Showing posts with label software. Show all posts
Showing posts with label software. Show all posts

Thursday, April 2, 2026

lsim and ldraw

Huh. I've never really talked about lsim here. Strange since I'm fairly pleased with it. I made an allusion to it here, but didn't really talk about it. Hmm ... maybe it's because it's a pretty niche project - of no particular use to anybody except me.

HAH! Like that's ever stopped me.

LSIM

Lsim is a hardware logic simulator. You specify a set of devices, like NAND gates, latches, LEDs, and switches, and specify how they connect. Lsim then simulates the circuit. The non-NAND logic devices are simply composites of NAND gates; my goal is to design a simple CPU using only NANDs.

It's a tool that only I could love. There's no GUI. No blinking lights. No wave forms. It's pure text, both input and output. It's a pain to describe the circuit with the little language I devised, and it's a bigger pain to interpret the output to see if it does what it is supposed to. I'm so proud.

Claude has helped me with lsim, mostly by reviewing it for me and finding bugs. I think I had it write one or two little utility functions (who remembers how to write vararg code?), but 99% of it is mine. The code reviews saved me much debugging time. Thanks Claude!

Some day I might make a blog post about its internals - it has a few interesting aspects - but let's skip that for now.

Anyway, my biggest problem has been interpreting the output of LSIM. I find I need to look at a properly-drawn circuit diagram so that I can visually trace signals and verify that the printout is doing what I want. But drawing logic circuits is hard, and it's even harder to ensure that the diagram matches the circuit given to lsim.

LDRAW

So I started the ldraw project. This is a GUI drawing tool that lets me draw a circuit diagram using the devices that lsim supports. It can then export an lsim input file containing the lsim commands to define the devices and connect them. Now I can create a circuit and know that the lsim commands match the drawing. Saves time and is much less error-prone.

It is NOT a general-purpose tool with a large library of standard parts. It is intended to be used with lsim, so it only supports the lsim components.

A few quick notes:

  • It's Javascript and CSS that lives inside a single html file and runs in the chrome browser.
  • It was written by Claude.ai. I used the chat interface (Opus 4.6 in "extended thinking" mode).
  • I "vibe coded" it, a term I don't like, but I don't like "hallucination" either. Coiners of new lingo don't come to me for advice.

Regarding the "vibe coding", I don't know Javascript, and I've never learned the libraries or environment of a browser. Sure, I could have learned it - what, maybe a week or two? - but I also have no interest in GUI work. I.e. it would be a chore. This is my hobby; I avoid chore work whenever possible. So I have not reviewed Claude's code.

Claude has. After every significant phase of development, I ask Claude to: "Perform a deep review for bugs, paying special attention to state management and potential opportunities to make the code more maintainable." Even though it just finished coding, it always finds a few things. One time it decreased code length by about 400 lines by replacing identical repeated code with a few helper functions.

And, of course, I've tested it. Given the nature of the program, most bugs show up pretty quickly. 

I'll post a few interesting details about the methodology we used in a different post.

WHY NOT CLAUDE CODE?

An obvious question: why use the chat interface and not Claude Code?

I tried CC for a different project. It failed. I had asked it to take my lsim language and convert it to a netlist that lcapy could use. Seemed like an easy enough project. CC cranked for an hour or two (with me having to be there the whole time to tell it to keep going). It kept getting errors from lcapy, and having to reverse-engineer the lcapy code to understand why. At the end it declared success. I fired up lcapy with the resulting .sch file, and it was complete garbage.

Now maybe this was just a fundamentally hard problem, and the web interface could not have done any better. Or maybe I didn't use it well (it was my first time trying it). But I can tell you this: using CC wasn't particularly fun. I enjoy the back-and-forth that the chatbot gives me. At the risk of over-anthropomorphizing a chatbot, it feels collaborative instead of directive.

It even laughs at my jokes. (Sort of...)


Thursday, August 14, 2025

Simple C REPL

I'm a fan of language REPLs (Read, Evaluate, Print, Loop). These are interactive programs that let you experiment with the language. For example, running the Python REPL let's you enter Python code interactively and get immediate output. No edit/compile/run cycle. REPLs are useful for experimenting with language features, exploring APIs, reproducing bugs, etc.

But I'm a C programmer, and C doesn't have a REPL. And sometimes I just want to explore details of the language, like sign extension rules and implicit type conversions. C just isn't well-suited to having a REPL. 

But since when has "not well-suited" ever stopped me?

Introducing crepl.sh : https://github.com/fordsfords/crepl

The doc is fairly comprehensive (Thanks Claude!), so I'll just show an annotated sample session:

$ ./crepl.sh
C REPL - Enter C statements or expressions
Type !help for commands
c> int x = 1;
c> x                         -- Omit semicolon to auto-print the expression.
i 1 (0x00000001)             -- The leading "i" indicates type, not the variable name
c> unsigned short j = 2      -- A declaration is not a legal expression; no autoprint!
Compilation error, line rejected. Enter '!errs' for details.
c> unsigned short j = 2;     -- Semicolon suppresses autoprint
c> j
us 2 (0x0002)                -- Autoprint knows its an unsigned short.
c> x+j
i 3 (0x00000003)
c> j=j+1                     -- An assignment statement is an expression.
us 3 (0x0003)
c> char c = -1;
c> c
c -1 (0xff)
c> x = c
i -1 (0xffffffff)                     -- Nice sign extension!
c> int inc(int x) { x++; return x; }; -- Define a function all on a single line.
c> inc(88)
i 89 (0x00000059)
c> x
i -1 (0xffffffff)                     -- Naturally "inc()" has its own local x.
c> int inc_x() { ++x; };              -- New funct. Oops, I forgot to return something.
c> int inc_x() { ++x; return x; };    -- Fix the funct? No, you can't re-define it.
Compilation error, line rejected. Enter '!errs' for details.
c> !vi                                -- This edits the code so far. I deleted inc_x().
c> int inc_x() { ++x; return x; };    -- Now I can define it properly.
c> inc_x()
i 0 (0x00000000)                      -- It treated x as a global. But is it?
c> inc_x()
i 1 (0x00000001)
c> x
i 1 (0x00000001)
c> !help
Commands:
  !help  - Show this help.
  !errs  - Show compilation/runtime errors from last attempt. Note that line numbers refer to the 'crepl_temp.c' file.
  !new   - Clear all accumulated code.
  !list  - Show current accumulated code.
  !vi    - Edit accumulated code in vi.
  !source filename - read input from filename.
  !sh    - start an interactive subshell. Exit shell to return to crepl.
  !quit  - Exit the REPL
Autoprint types handled:
  char, unsigned char, short, unsigned short,
  int, unsigned int, long, unsigned long,
  long long, unsigned long long, float, double
c> !list
Current code:

int x = 1;
x;
unsigned short j = 2;
j;
x+j;
j=j+1;
char c = -1;
c;
x = c;
int inc(int x) { x++; return x; };
inc(88);
x;
;
int inc_x() { ++x; return x; };
inc_x();
inc_x();
x;
c> !quit
Goodbye!

So, is the variable 'x' a global? The inc_x() function incremented it, so it must be, right? (Spoiler, it's not. See this doc for explanation.)

Friday, December 13, 2024

Some Useful C Modules

 I'm working on a non-trivial bit of C programming, and I decided to externally modularize three parts of it as potentially reusable components:

  1. err - error-handling module.
  2. hmap - hash map module.
  3. cfg - configuration file loader module.
All three are intend to be simple and small. Note that "err" has no external dependencies, "hmap" leverages "err" but includes a copy of it in its repo, and "cfg" leverages both "err" and "hmap", and includes copies of them too. I know having all those copies seems wasteful, but C doesn't have the same kind of dependency and versioning infrastructure that Java has, and including the files makes each repo stand-alone.

Of the three, "err" will probably be the least likely to be reused by anybody but me, but in some ways is the most helpful, IMO. To quote from its doc:

The C language does not have a well-established methodology for APIs to report errors. Java has exceptions, but C does not. The closest thing that C has to a common methodology is the Unix common practice in which a function returns certain valid values that represent success (the values can vary according to the API function), and a certain invalid value for failure. Callers are expected to check the return value for validity, and refer to "errno" (when available) for information about the error.

In my experience, that kind of error reporting methodology is a recipe for unreliable programs that are hard to debug and fix. Thousands of lines of code written that call APIs without checking the return status, or does check but only prints something unhelpful even to the code maintainers.
See also my earlier post, "Error handling: the enemy of readability?". Note that the "err" system described here is NOT the same as described in that earlier post, but you'll see similarities. This current "err" system evolved the right way, by putting it to use in non-trivial coding efforts. It is battle-tested and has proven its worth ... at least to me.

Finally, full disclosure, Claude.ai helped me in many ways. A little bit with the coding itself, but much more so in many other ways. See "Claude as Coder's Assistant" for a longer description of how I use Claude.

Finally, as I alluded, the above three modules are just supporting cast members of a larger effort that I've been working on: lsim - a digital logic simulator. It's a hobby project - no self-respecting hardware engineer would actually use it for real work - but it's been a fun couple of months getting it working! And, as an actual user of the above three modules, it has evolved those modules into something more useful than they were when they started. It's hard to make a good API if you don't eat your own dogfood.

You might notice that lsim is not well-documented (ok, the hardware definition language IS well documented, thanks to Claude.ai). This is because lsim, while the most ambitious of these code bases, is also the least likely to have anything usable or re-usable by anybody but me.

If I'm wrong and you would like to take it for a test drive, let me know and I'll give you a hand.

Claude as Coder's Assistant

 My love affair with Claude.ai continues.

I don't actually use it much for coding. Code is my hobby, I don't want much help doing that. (Although here's an example where I did ask it to write a function: I couldn't remember how to write variadic functions, but I wanted one for error reporting with a printf-style interface. I've done it years ago and couldn't remember how. I didn't feel like spending 20 minutes re-teaching myself.)

I use Claude for:

  • Code Reviews (it finds bugs so I don't have to!).
  • Writing Doc.
  • Remembering API names ("What's that function that's better to use than atoi()?").
  • Bringing me up to speed on tools (I've just started using VSCode, and Claude has saved me much time).
  • Discussing pros and cons of design decisions. Sometimes it comes up with considerations I didn't think of. Sometimes it's just the process of explaining it that clarifies the design in my own mind.
  • Asking questions about the C standard to improve my code's portability. (Claude knows the standard much better than I do.)
  • Brainstorming naming conventions (sometimes I get stuck trying to think of a good name).
  • Help with warnings when I finally turned on super-picky gcc options.

I want to go deeper on a few of those points.

Code Reviews

Overall, Claude-based code reviews are helpful. They've pointed out several cases of cut-and-paste errors that were incompletely made. They've pointed out some inconsistencies that I was glad to fix. And made some suggestions for improvement that I've taken. But it also gets false positives (e.g. claiming a buffer overrun risk where there is none); I think some of that comes from "wanting" too hard to find issues and resorting to raising issues that are often raised in code reviews. Also, for a large codebase with multiple C files, I've seen it get confused and very simply find fewer things. It finds more things with smaller reviews. So not perfect, but I'm often surprised at the useful things it does find.

I have been impressed at how well it makes assumptions given incomplete code. For example, I have a logic simulator with two main modules; one a language processor and the other the main logic engine. You don't get a complete view of the big picture without seeing both files. But just as a human can infer much from the names of functions that are called and the context in which they are called, Claude was also able to.

One thing it does NOT do well is request additional information. If I were reviewing a module and needed another one in order to evaluate the correctness of some code, I would request access to the other module. Claude just makes do with what it has, making reasonable assumptions (but not identifying those assumptions), and when those assumptions are wrong, so too are its conclusions.

Finally, missing from the review is higher-level discussion of alternate designs. To be honest, that is usually also lacking with human reviews, but at least as a reviewer I could initiate such a discussion. With Claude I don't get much traction on that besides some general platitudes about good design patterns.

Bottom line: while there are some benefits from human review that Claude cannot match, there are some things I think Claude does better, like finding cut-and-paste problems and other things that are pattern-based. I think the two forms compliment each other.

Doc

This is an area that Claude kind of blew me away. As an experiment, I took the two main modules of my logic simulator and stripped out all comments. I then asked Claude to reverse-engineer the code and write documentation for the circuit design language I implemented. It did an amazing job; I only made a few minor tweaks to the doc it generated. It was able to infer various intents behind the code with deep understanding. In particular, while one module was primarily focused on the overall language parsing, the other module contained device-dependent interpretation of the I/O terminal identifiers. As an example, I established the convention that normal connections use lower-case, while "not" connections use upper-case. I.e. "q" and "Q" represent "q" and "not q". The only hint for that was a line of code to the effect, "Q = (1 - q);" It generated doc describing the convention.

Not only did it impressed me, it also saved me time. I really was able to take the doc and wholesale insert it. Yes, I made some tweaks as I proofread it, but it converted probably two hours of work into ten minutes of work. And while I don't hate writing documentation, for my hobby I would rather code the document, so it really did increase my enjoyment of my hobby.

Tool Help

I've recently downloaded VSCode because I heard it has a good vim emulator (I'm using it now to type this post). And I'm very happy with it. Finally I'm getting the benefits of a good IDE that can do code refactoring for me. Even just being able to click on an error message and have my cursor popping onto the offending source code line is a time saver. However, VSCode is an advanced tool, and it's not always intuitive how to get things done. Claude to the rescue.

I've asked Claude any number of questions about VSC, and while it doesn't get it right 100% of the time, it's doing better than 80%. For example, it created "tasks" for me to run my compile script and my test script. It also helped me create problem matching patterns so that errors generated by my own program will be recognized as errors and produce clickable file:line links. This is a testament to both VSC and to Claude for quickly showing me how to do it. The alternative would be days worth of Stack Overflow Q&A. I've gotten up to speed on VSC in a fraction of the time I could do on my own. And the help has prevented impatience and frustration from leading me to throw up my hands and go back to command-line vim!

Conclusion

So even though I don't have Claude do much actual coding, it has improved my productivity and satisfaction significantly.

And yes, sometimes I just have conversations with it. I have to laugh every time it claims to have fought some of the same coding battles that I describe (no you haven't!), but I play along since it is emulating how another human would likely respond, and sometimes I'm surprised at how well it does with simple water cooler banter. I've even told it that it's the perfect conversational partner - it doesn't have its own agenda and will follow my conversational lead without friction wherever I lead it. It isn't offended if I ignore its final "engagement" question. And it's always complimenting me on my insights ... so much so that I've created a style to tone it down a bit. (But if I'm feeling low, I'll go back to its normal mode of being overly enthusiastic.)

Claude even found a few typos in this post. Thanks Claude!

Friday, August 25, 2023

Visual Studio Code

 I've been doing more coding than usual lately. As a vi user, I've been missing higher-level IDE-like functionality, like:

  • Shows input parameters to functions (without having to open the .h file and search).
  • Finds definitions of functions, variables, and macros.
  • Finds references to same.
  • Quickly jumping to locations of compile errors. (Most IDEs do syntax checking as you type.)
  • Source-level debugging.
There are other functions as well, like code-refactoring, static analysis, and "lint" capabilities, but the above are the biggies in my book.

Anyway, I've used Visual Studio, Eclipse, and JetBrains, and found those higher-level functions helpful. But I hate GUI-style text editors.

I've gotten good at using emacs and vi during my many years of editing source files. It takes time to get good at a text editor - training your fingers to perform common functions lightning fast, remembering common command sequences, etc. I finally settled on vi because it is already installed and ready to use on every Unix system on the planet. And my brain is not good at using vi one day and emacs the next. So I picked vi and got good at it. (I also mostly avoid advanced features that aren't available in plain vanilla vi, although I do like a few of the advanced regular expressions that VIM offers.)

So how do I get IDE-like functionality in a vi-like editor?

I looked Vim and NeoVIM, both of which claim to have high-quality IDE plugins. And there are lots of dedicated users out there who sing their praises. But I've got a problem with that. I'm looking for a tool, not an ecosystem. If I were a young and hungry pup, I might dive into an ecosystem eagerly and spend months customizing it exactly to my liking. Now I'm a tired old coder who just wants an IDE. I don't want to spend a month just getting the right collection of plugins that work well together.

(BTW, the same thing is true for Emacs. A few years ago, I got into Clojure and temporarily switched back to Emacs. But again, getting the right collection of plugins that work well together was frustratingly elusive. I eventually gave up and switched back to vi.)

Anyway, as a tired old coder, I was about to give up on getting IDE functionality into a vi-like editor, but decided to flip the question around. What about getting vi-like editing into an IDE?

Turns out I'm not the first one to have that idea. Apparently most of the IDEs have vi editing plugins now-a-days. This was NOT the case several years ago when I last used an IDE. I used a vi plugin for Eclipse which ... kind of worked, but had enough problems that it wasn't worth using.

That still leaves the question: which IDE to use. Each one has their fan base and I'm sure each one has some feature that it does MUCH better than the others. Since programming is not my primary job, I certainly won't become a power user. Basically, I suspect it hardly matters which one I pick.

I decided to start with Visual Studio Code for a completely silly reason: it has an easy integration with GitHub Copilot. I say it's silly because I don't plan to use Copilot any time soon! For one thing, I don't code enough to justify the $10/month. And for another, coding is my hobby. The small research I've done into Copilot suggests that to get the most out of it, you shift your activities towards less coding and more editing and reviewing. While that might be a good thing for a software company, it's not what I'm looking for in a hobby. But that's a different topic for a different post.

Anyway, I've only been using Visual Studio Code for about 30 minutes, and I'm already reasonably pleased with the vi plugin (but time will tell). And I was especially pleased that it has a special integration with Windows WSL (I'm not sure other IDEs have that). I was able to get one of my C programs compiled and tested. I even inserted a bug and tried debugging, which was mildly successful.


Monday, May 8, 2023

More C learning: Variadic Functions

This happens to me more often than I like to admit: there's a bit of programming magic that I don't understand, and almost never need to use, so I refuse to learn the method behind the magic. And on the rare occasions that I do need to use it, I copy-and-tweak some existing code. I know I'm not alone in this tendency.

The advantage is that I save a little time by not learning the method behind the magic.

The disadvantages are legion. Copy-and-tweak without understanding leads to bugs, some obvious, others not so much. Even the obvious bugs can take more time to track down and fix than it would have taken to just learn the magic in the first place.

Such was the case over the weekend when I wanted to write a printf-like function with added value (prepend a timestamp to the output). I knew that variadic functions existed, complete with the "..." in the formal parameter list and the "va_list", "va_start", etc. But I never learned it well enough to understand what is going on with them. So when I wanted variadic function A to call variadic function B which then calls vprintf, I could not get it working right.

Ugh. Guess I have to learn something.

And guess what. It took almost no time to understand, especially with the help of the comp.lang.c FAQ site. Specifically, Question 15.12: "How can I write a function which takes a variable number of arguments and passes them to some other function (which takes a variable number of arguments)?" Spoiler: you can't. Which makes sense when you think about how parameters are passed to a function. The longer answer: there's a reason for the "leading-v" versions of the printf family of functions. And the magic is not as magical as I imagined. All I needed to do is create my own non-variadic "leading-v" version of my function B which my variadic function A could call, passing in a va_list. See cprt_ts_printf().

This post is only partly about variadic functions; it's also about the reluctance to learn something new. Why would an engineer do that? I could explain it in terms of schedule pressure and the urge to make visible progress ("stop thinking and start typing!"), but I think there's something deeper going on. Laziness? Fear of the unknown? I don't know, but I wish I didn't suffer from it.

By the way, that comp.lang.c FAQ has a ton of good content. Good thing to browse if you're still writing in C.

Sunday, December 25, 2022

Critical Program Reading (1975) - 16mm Film

 I find this film delightful: Critical Program Reading (1975) - 16mm Film

I would love to know about choices the filmmaker made. The vibe seems very 1960s; was that intentional?

I also didn't know that structured programming methods were that old. I was born in 1957. According to Wikipedia, the concept of "structured programming" was born in those years, although the term was first popularized by Dijkstra in his 1968 open letter "Goto Considered Harmful".

For some reason, I thought the "structured programming wars" were during the mid-to-late 1980s, when the old-school "spaghetti code" techniques were finally being replaced by more modern techniques. I guess I thought this because I clearly remember the "Goto Considered Harmful" Considered Harmful letter, and its replies. But the true war against spaghetti code was pretty much over by then. The battle at that point was not about if we should use descriptive identifier naming, block structure, and simple control flow. It was about whether the abolition of the goto should be absolute.

<rant read="optional">

I also remember feeling insulted by Dijkstra's On a Somewhat Disappointing Correspondence. He said that a competent professional programmer in 1987 should know the theorem of "the bounded linear search" and should be able to derive that theorem and its proof. I could not even read the theorem since I was not familiar with the notation. And none of my colleagues could either. I suspect that a small percentage of professional programmers of the day (and today also) would qualify as competent by Dijkstra's standards.

In retrospect, I do have some sympathy for Dijkstra's opinion. He knew full well that his standards did not match those of the programming profession. That's exactly what he was complaining about. He strongly felt that programmers should be grounded in the science of computer science. He wanted programmers to spend their time proving their algorithms correct, not slavishly (and inadequately) testing them. I suspect he wasn't saying that the programmers of the day were bad or stupid people, but that they were improperly educated and then released into the field prematurely. I suspect he might agree with, "You are not competent, but it's probably not your fault. It's more the fault of the university that gave you a degree and the company that hired you." Part of me that wishes that I and the rest of the world were more dedicated to rigor and depth of mastery.

But, of course, we are not. Airline pilots are not trained to design an airplane. House painters can't give you the chemical formulae of their paints. I remember when my wife had cancer, she was advised against using a surgeon who was a highly respected researcher; she should use a doctor who does hundreds of these surgeries per year. You usually want an experienced practitioner, not a theoretician.

Is the same thing true of programmers? Well, I will note that Dijkstra's program uses single-letter variables, a definite no-no in most structured programming. If he had submitted that to me as part of a job application, I doubt I would have hired him. But maybe that's because *I* am not competent. Maybe software would be much better today if we programmers met Dijkstra's standards. But there would be a heck of a lot less software out there, that's for sure. And cynical humor aside, I do rather like having a smart phone with a GPS.

</rant>

Wednesday, May 4, 2022

CC0 vs GPL

I've been writing little bits and pieces of my own code for many years now. And I've been releasing it as CC0 ("public domain"; see below). I've received a bit of criticism for it, and I guess I wanted to talk about it.

I like to write software. And I like it when other people benefit from my software. But I don't write end-user software, so the only people who benefit from my code are other programmers. But that's fine, I like a lot of programmers, so it's all good.

There are different ways I could offer my software. Much open-source software is available under a BSD license, an Apache license, or an MIT license. These differ in ways that are probably important to legal types, but for the most part, they mean that you can use the code for pretty much any purpose as long as you give proper attribution to the original source. So if I write a cool program and use some BSD code, I need to state my usage of that code somewhere in my program's documentation.

So maybe I should do that. After all, if I put in the effort to write the code, shouldn't I get the credit?

Yeah, that and a sawbuck will get me a cup of coffee. I don't think those attributions are worth much more than ego-boosting, and I guess my programmer ego doesn't need that boost.

With the exception of the GNU Public License (GPL), I don't think most open source ego-boosting licenses buy me anything that I particularly want. And they do introduce a barrier to people using my code. I've seen other people's code that I've wanted but decided not to use because of the attribution requirement. I don't want the attributions cluttering up my documentation, and adding licensing complications to anybody who wants to use my code. (For example, I was using somebody else's getopt module for a while, but realized I wasn't giving proper attribution, so I wrote my own.)

But what about GNU?

The GPL is a different beast. It is intended to be *restrictive*. It puts rules and requirements for the use of the code. It places obligations on the programmers. The stated goal of these restrictions is to promote freedom.

But I don't think that is really the point of GPL. I think the real point of GPL is to let certain programmers feel clean. These are programmers who believe that proprietary software is evil, and by extension, any programmer who supports proprietary software is also evil. So ignoring that I write proprietary software for a living, my CC0 software could provide a small measure of support for other proprietary software companies, making their jobs easier. And that makes me evil. Not Hitler-level evil, but at least a little bit evil.

If I license my code under GPLv3, it will provide the maximum protection possible for my open-source code to not support a proprietary system. And that might let me sleep better at night, knowing that I'm not evil.

Maybe somebody can tell me where I'm wrong on this. Besides letting programmers feel clean, what other benefit does GPL provide that other licenses (including CC0) don't?

I've read through Richard Stallman's "Why Open Source Misses the Point of Free Software" a few times, and he keeps coming back to ethics, the difference between right and wrong. Some quotes:

  • "The free software movement campaigns for freedom for the users of computing; it is a movement for freedom and justice."
  • "These freedoms are vitally important. They are essential, not just for the individual users' sake, but for society as a whole because they promote social solidarity—that is, sharing and cooperation."
  • "For the free software movement, free software is an ethical imperative..."
  • "For the free software movement, however, nonfree software is a social problem..."
I wonder what other things a free software advocate might believe. Is it evil to have secret recipes? Should Coke's secret formula be published? If I take a recipe that somebody puts on youtube and I make an improvement and use the modified recipe to make money, am I evil? What if I give attribution, saying that it was inspired by so-and-so's recipe, but I won't reveal my improvement? Still evil?

How about violin makers that have secret methods to get a good sound? Evil?

I am, by my nature, sympathetic to saying yes to all of those. I want the world to cooperate, not compete. I used to call myself a communist, believing that there should be no private property, and that we should live according to, "From each according to his ability, to each according to his needs". And I guess I still do believe that, in the same way that I believe we should put an end to war, cruelty, apathy, hatred, disease, hunger, and all the other social and cultural evils.

Oh, and entropy. We need to get rid of that too.

But none of them are possible, because ... physics? (That's a different subject for a different day.)

But maybe losing my youthful idealism is nothing to feel good about. Instead of throwing up my hands and saying it's impossible to do all those things, maybe I should pick one of them and do my best to improve the world. Perhaps the free software advocates have done exactly that. They can't take on all the social and cultural ills, so they picked one in which they could make a difference.

But free software? That's the one they decided was worth investing their altruism?

Free software advocates are always quick to point out that they don't mean "free" as in "zero cost". They are referring to freedoms - mostly the freedom to run a modified version of a program, which is a freedom that is meaningless to the vast majority of humanity. I would say that low-cost software is a much more powerful social good. GPL software promotes that, but so do the other popular open-source licenses. (And so does CC0).

So anyway, I guess I'm not a free software advocate (big surprise). I'll stick with CC0 for my code.

What is CC0

The CC0 license attempts to codify the concept of "public domain". The problem with just saying "public domain" is that the term does not have a universally agreed-upon definition, especially legally. So CC0 is designed to approximate what we think of as public domain.

Thursday, August 5, 2021

Timing Short Durations

 I don't have time for a long post (HA!), but I wanted to add a pointer to https://github.com/fordsfords/nstm ("nstm" = "Nano Second Timer"). It's a small repo that provides a nanosecond-precision time stamp portably between MacOS, Linux, and Windows.

Note that I said precision, not resolution. I don't know of an API on Windows that gives nanosecond resolution. The one Microsoft says you should use (QueryPerformanceCounter()) always returns "00" as the last two decimal digits. I.e. it is 100 nanosecond resolution. They warn against using "rdtsc" directly, although I wonder if most of their arguments are mostly no longer applicable. I would love to hear if anybody knows of a Windows method of getting nanosecond resolution timestamps that is reliable and efficient.

One way to measure a short duration "thing" is to time doing the "thing" a million times (or whatever) and take an average. One advantage of this approach is that taking a timestamp itself takes time; i.e. making the measurement changes the thing you are measuring. So amortizing that cost over many iterations minimizes its influence.

But sometimes, you just need to directly measure short things. Like if you are histogramming them to get the distribution of variations (jitter).

I put some results here: https://github.com/fordsfords/fordsfords.github.io/wiki/Timing-software


Friday, July 9, 2021

More Perl "grep" performance

In an earlier post, I discovered that a simple Perl program can outperform grep by about double. Today I discovered that some patterns can cause the execution time to balloon tremendously.

I have a new big log file, this time with about 70 million lines. I'm running it on my newly-updated Mac, whose "time" command has slightly different output.

Let's start with this:

time grep 'asdf' cetasfit05.txt
... 39.388 total

time grep.pl 'asdf' cetasfit05.txt
... 21.388 total

About twice as fast.


Now let's change the pattern:

time grep 'XBT|XBM' cetasfit05.txt
... 24.787 total

time grep.pl 'XBT|XBM' cetasfit05.txt
... 18.940 total

Still faster, but nowhere near twice as fast. I don't know why 

Now let's add an anchor:

time grep '^XBT|^XBM' cetasfit05.txt
... 25.580 total

time grep.pl '^XBT|^XBM' cetasfit05.txt
... 3:08.25 total

WHOA! Perl, what happened????? 3 MINUTES???

My only explanation is that Perl tries to implement  a very general regular expression algorithm, and grep implements a subset, and that might cause Perl to be slow in some circumstances. For example, maybe the use of alternation with anchors introduces the need for "backtracking" under some circumstances, and maybe grep doesn't support backtracking. In this simple example, backtracking is probably not necessary, but to be general, Perl might do it "just in case". (Note: I'm not a regular expression expert, and don't really know when "backtracking" is needed; I'm speculating without bothering to learn about it.)

Anyway, let's make a small adjustment:

time grep.pl '^(XBT|XBM)' cetasfit05.txt
... 17.910 total

There, that got back to "normal".

I guess multiple anchors in a pattern is a bad idea.


P.S. - even though this post is about Perl, I tried one more test with grep:

time grep 'ASDF' cetasfit05.txt
... 26.132 total

Whaaa...? I tried multiple times, and lower-case 'asdf' always takes about 40 seconds, and upper-case 'ASDF' always takes about 27 seconds. I DON'T UNDERSTAND COMPUTERS!!! (sob)

Friday, November 27, 2020

Sometimes you need eval

 The Unix shell usually does a good job of doing what you expect it to do. Writing shell scripts is usually pretty straight-forward. Yes, sometimes you can go crazy quoting special characters, but for most simple file maintenance, it's not too bad.

I *think* I've used the "eval" function before today, but I can't remember why. I am confident that I haven't used it more than twice, if that many. But today I was doing something that seemed like it shouldn't be too hard, but I don't think you can do it without "eval".

RSYNC

I want to use "rsync" to synchronize some source files between hosts. But I don't want to transfer object files. So my rsync command looks somewhat like this:

rsync -a --exclude "*.o" my_src_dir/ orion:my_src_dir

The double quotes around "*.o" are necessary because you don't want the shell to expand it, you want the actual string *.o to be passed to rsync, and rsync will do the file globbing. The double quotes prevents file glob expansion. And the shell strips the double quotes from the parameter. So what rsync sees is:

rsync -a --exclude *.o my_src_dir/ orion:my_src_dir

This is what rsync expects, so all is good.

PUT EXCLUDE OPTIONS IN A SYMBOL: FAIL

For various reasons, I wanted to be able to override that exclusion option. So I tried this:

EXCL='--exclude *.o'  # default
... # code that might change EXCL
rsync -a $EXCL my_src_dir/ orion:my_src_dir

But this doesn't work right. The symbol "EXCL" will contain the string "--exclude *.o", but when the shell substitutes it into the rsync line, it then performs file globbing, and the "*.o" gets expanded to a list of files. For example, rsync might see:

rsync -a --exclude a.o b.o c.o my_src_dir/ orion:my_src_dir

The "--exclude" option only expects a single file specification.

SECOND TRY: FAIL

So maybe I can enclose $EXCL in double quotes:

rsync -a "$EXCL" my_src_dir/ orion:my_src_dir

This passes "--exclude *.o" as a *single* parameter. But rsync expects "--exclude" and the file spec to be two parameters, so it doesn't work either.

THIRD TRY: FAIL

Finally, maybe I can force quotes inside the EXCL symbol:

EXCL='--exclude "*.o"'  # default
... # code that might change EXCL
rsync -a $EXCL my_src_dir/ orion:my_src_dir

This almost works, but what rsync sees is:

rsync -a --exclude "*.o" my_src_dir/ orion:my_src_dir

It thinks the double quotes are part of the file name, so it won't exclude the intended files.

EVAL TO THE RESCUE

The solution is to use eval:

EXCL='--exclude "*.o"'  # default
... # code that might change EXCL
eval "rsync -a $EXCL my_src_dir/ orion:my_src_dir"

The shell does symbol substitution, so this is what eval sees:

rsync -a --exclude "*.o" my_src_dir/ orion:my_src_dir

And eval will re-process that string, including stripping the double quotes, so this is what rsync sees:

rsync -a --exclude *.o my_src_dir/ orion:my_src_dir

which is exactly correct.

P.S. - if anybody knows of a better way to do this, let me know!

EDIT: The great Sahir (one of my favorite engineers) pointed out a shell feature that I didn't know about:;

Did you consider setting noglob? It will prevent the shell from expanding '*'. Something like:

    EXCL='--exclude *.o' # default
    set -o noglob
    rsync -a $EXCL my_src_dir/ orion:my_src_dir
    set +o noglob

I absolutely did not know about noglob! In some ways, I like it better. The goal is to pass the actual star character as a parameter, and symbol substitution is getting in the way. Explicitly setting noglob says, "hey shell, I want to pass a string without you globbing it up." I like code that says exactly what you mean.

One limitation of using noglob is that you might have a command line where you want parts of it not globbed, but other parts globbed. The noglob basically operates on a full line. So you would need to do some additional string building magic to get the right things done at the right time. But the same thing would be true if you were using eval. Bottom line: the shell was made powerful and flexible, but powerful flexible things tend to have subtle corner cases that must be handled in non-obvious ways. No matter what, a comment might be nice.

FULL DISCLOSURE: I tried it and it didn't work as expected. It's probably related to all the crazy quoting. Since my "eval" solution worked, I didn't invest the time to figure out why the "noglob" method didn't work. So I'm still using eval even though noglob is arguably better for this purpose.

Sunday, July 12, 2020

Perl Diamond Operator

As my previous post indicates, I've done some Perl noodling this past week. (I can't believe that was my first Perl post on this blog! I've been a Perl fan for a loooooooong time.)

Anyway, one thing I like about Perl is it takes a use case that tends to be used a lot and adds language support for it. Case in point: the diamond operator "<>" (also called "null filehandle" or "null angle operator").

See the "Tutorial" section below if you are not familiar with the diamond operator.

Tips

I may expand this as time goes on.

Filename / Linenumber

Inside the loop, you can use "$." as the line number and "$ARGV" as the file name of the currently open file.

*BUT*, see next tip.

Continue / Close

Always code  your loop as follows:

while (<>) {
  ...
} continue {
  close ARGV if eof;
}

The continue clause is needed to have "$." refer to the line number within the *current* file. Without it "$." will refer to the total number of lines read so far.

In my opinion, even if what you want is total lines and not line within file, you should still code it like the above and just use your own counter for the total line number. This provides consistency of meaning for "$.". Plus, it's possible that in the future you will want to add functionality that requires line within file, and it's messy to code that with your own counter.

Skip Rest of File

Sometimes you get a little ways into a file and you decide that you're done with the file and would like to skip to the next (if any). Include this inside the loop:

close ARGV;  # skip rest of current file

Positional Parameter

Let's say you're writing a perl version of grep, and you want the first positional parameter (after the options) to be the search pattern.

$ grep.pl "ford" *.txt

Unfortunately, this will try to read a file named "ford" as the first file. What to do?

my $pat = shift;  # Pops off $ARGV[0].
while (<>) {
  ...

This works because "<>" doesn't actually look at the command line. It looks at the @ARGV array. The "shift" function defaults to operating on the @ARGV array.

Security Warning

Because of the way the diamond operator opens files, it is possible for a hostile user to construct a file that can produce very bad results. For example:

$ echo "hello world" >x
$ echo "goodby world" >'rm x|'
$ ls -1
rm x|
x
$ cat *
goodby world
hello world
$ cat x
hello world

So far, so good. "rm x|" is just an unusually-named file with a space in the middle and a pipe ("|") at the end. But now let's use my perl version of grep with a pattern of "." (matches all non-empty lines):

$ grep.pl "." *
Can't open x: No such file or directory at /home/sford/bin/grep.pl line 81.
$ cat x
cat: x: No such file or directory

Yup, grep.pl just deleted the file named "x". The pipe character at the end of the file "rm x|" invoked Perl's opening a filehandle into a command functionality (with the 2-argument open). In other words, by just naming a file in a particular way, you've made grep.pl do something unexpected and potentially dangerous.

This might look like a horrible security hole (what if the name of that rogue file resulted in deleting all your files?), but it can also be a very powerful (albeit rarely used) feature. The moral of the story is don't run *any* tool over a set of files that you aren't familiar with.

You can also instead use "<<>>" instead of "<>". But this requires Perl version 5.22 or newer, which rarely seems to be on any system I try to use. This will force each input file to be opened as a file, not potentially as a command.

Unfortunately, it also prevents the special handling of input file named "-" to read standard input. This is a construct that I do use periodically.

Tutorial

Many Unix commands have the following semantics:

cmdname -options [input_file [input_file2 ...] ]

where the command will read from each input file sequentially, or from standard input if no input files are provided. File names can be wildcarded. Most such Unix commands allow you to supply "-" as a file name and the tool will read from standard input.

The diamond operator makes this ridiculously easy. Here's a minimal "cat" command in Perl:

#!/usr/bin/env perl
while (<>) {
  print $_;
}


That's the whole thing. It takes zero or more input files (if none, reads from standard input) and concatenates them  to standard out. Just like "cat".

Specifically what "<>" does is read one line from whatever input file is currently open. If it is at the end of the file, "<>" will automatically open the next file (if any) and read a line from it. As with many Perl built-ins, it leaves the just-read line in the "$_" variable.

You should be ready for the "Tips" section now.

Saturday, July 11, 2020

Perl Faster than Grep

So, I've been crawling through a debug log file that is 195 million lines long. I've been using a lot of "grep | wc" to count numbers of various log messages. Here's some timings for my Macbook Pro:

$ time cat dbglog.txt >/dev/null
real 0m35.423s

$ time wc dbglog.txt
195177935 1177117603 28533284864 dbglog.txt
real 1m44.560s

$ time egrep '999999' dbglog.txt
real 7m39.737s

(For this timing, I chose a pattern that would *NOT* be found.)

On the Macbook, the man page for fgrep claims that it is faster than grep. Let's see:

$ time fgrep '999999' dbglog.txt
real 7m11.365s

Well, I guess it's a little faster, but nothing to brag about.

Then I wanted to create a histogram of some findings, so I wrote a perl script to scan the file and create the histogram. Since it performed regular expression matching on every line, I assumed it would be a little slower than grep, since Perl is an interpreted language.

$ time ./count.pl dbglog.txt >count.out
real 3m9.427s

WOW! Less than half the time!

So I created a simple grep replacement: grep.pl. It doesn't do any histogramming, so it should be even faster.

$ time grep.pl '999999' dbglog.txt
real  2m8.341s

Amazing. Perl grep runs in less than a third the time of grep.

For small files, I bet Perl grep is slower starting up. Let's see.

$ time echo "hi" | grep 9999
real        0m0.051s

$ time echo "hi" | grep.pl 9999
real        0m0.113s

Yep. Grep saves you about 60 milliseconds. So if you had thousands of small files to grep, it might be faster to use grep.



UPDATE:

I got another big log file today (70 million lines) and saw something pretty surprising given my initial findings.


Sunday, June 16, 2019

Should everybody learn to code?

I saw something on SlashDot that raised the question: should all school children learn to code?

Yes.

For the same reason all school children should spend some time learning to sing, learning to do long division, learning to paint, learning some physics, learning some literature, learning to use a wrench and screwdriver, learning some history, learning some chemistry, etc, etc, etc. I believe that children who get a reasonably well-rounded, reasonable quality education grow up to be happier than those who don't, all else being equal. I don't have thousands of pages of peer-reviewed scientific research to support my belief, but I still believe it.

This does not mean that we should try to teach all children to be *good* programmers, any more than we should try to teach all children to be professional-level singers. We just need to introduce a wide-range of subjects so that they have a basic understanding of what the subject is about. After that, let their natural talents and interest drive where they go in-depth.

I never knew that software was my calling until I had my first opportunity to try coding in ... was it 1975? This introduction was *not* taught in school. It was a group called "explorer scouts", which may or may not have been associated with Boy Scouts of America, I don't remember. All I can tell you is that there were no uniforms, no camping, none of the trappings of Boy Scouts. The only activity I can remember was the programming. It was Fortran on a time-sharing system with a teletype. The fire it ignited in my brain overwhelms any other memories of the Explorer Scouts.  THIS is what I wanted to do; my path in life was obvious.


HOW TO TEACH IT?

So, how should programming be taught?  I don't know. I know there has been a lot of research and development in the area of programming systems for young people. Drag-and-drop icons representing programming constructs. I've glanced at them, and even used one briefly (it was a google doodle). It's OK, I guess.  I think an important goal is to give a child early success and a feeling of accomplishment. I don't think those systems actually teach *programming*, but I think they probably do teach some fundamental concepts that are needed for programming.

My concern is that those systems don't really give a flavor of what programing is like. If you sing in music class, you get a sense of what singing is. If you play touch football, you get a sense of what that is. I actually consider myself fortunate that I wasn't introduced to programming that way. I suspect I would have thought that it was kind of neat, but I'm not sure it would have lit the fire. Part of what inspired me with my exposure was just the limitless possibilities. I'm not sure you get that with drag-and-drop programming.

The problem with most languages is the sheer amount of infrastructure you need to master before you can do things.  To write a simple Java program, you need to define a namespace and a class.  To print something, you need to enter System.out.println("something");. To read a line from the user will require several lines of junk that would require hours of explanation if you want the person to understand what the lines mean. Granted, you could just boilerplate it and tell the student to ignore all that stuff till later, but I think that reduces engagement. As does the edit/compile/run cycle. I don't think Java is a good first language, at least not for young children (and maybe nobody).

I have taught a few people to program. Want to know what worked for me?

"NO!  DON'T SAY IT!  PLEASE FOR THE LOVE OF ALL THAT IS DECENT, NO!"

Basic.

"GAAAAAAAAAAAAAAH %@!&*$!!!"


BASIC???

Yep. And I mean BASIC Basic. Line numbers. One or two character variable names. No "else".

10 print "hello"
20 goto 10
run

I find that humor is a good teaching tool.  If your first program is an infinite print loop, it's kind of funny. Not very funny, but a little. Also, you definitely need a REPL (Read-Eval-Print Loop), which Basic is. No edit/compile/run cycle please.

It takes me an afternoon to teach somebody to program. We don't get sophisticated, but by the end of the afternoon we do end up writing a simple "text adventure" style game.

You are facing two doors.  Do you want the right one or the left one?
? right
A lion has just eaten you!!!
Try again?
? yes
You are facing two doors.  Do you want the right one or the left one?
? left
You found a gold coin!
Do you want to keep going?  Or turn around?  (answer "going" or "turn")
? turn
You are facing two doors.  Do you want the right one or the left one?

etc. See? More humor. How many ways can you kill the player?  :-)

I feel that the Basic language didn't get in the way of seeing the simple algorithm. Any other language would have obscured the simple beauty of what we were doing.

[Update] Now mind you, this is a one-day exercise! For somebody with more interest, I would have a second lesson with subroutines and gross Basic-style input parameters (globals, actually), and for the third lesson, we would switch to a modern language. Maybe Python? Maybe Java? Maybe Lisp? It would depend on who it is and what they might do with it.

Anyway, I did this 1-day lesson with my daughter (she was ... I don't remember ... maybe 10?) and she did well.  Afterwards, she did ask me some questions and got a little help.  Later that year she gave me a birthday present. It was a CD ROM with her updated adventure program. It has maybe a dozen "rooms" and lots of jokes and little surprises in it.  She was VERY proud of herself. (And me of her; I still have the disk.)

But she pretty much lost interest. I am confident it was *not* because of the primitive language. It was just because it didn't light a fire in her. Which is fine.

Anyway, I really think an afternoon of Basic is a good way to introduce programming to a newbie of any age.  I *know* it works.

Wednesday, April 17, 2019

Black Hole Revealed!

I am in awe of the results of the Event Horizon Telescope team in their image of M87*, the supermassive black hole at the center of a distant galaxy!  I can't help but be especially excited at the role that computer scientists played in the analysis and reconstruction of the data collected by the radio telescopes to produce the image.

Radio astronomers have been doing long-baseline interferometry for a while now to produce images.  But the challenges of the Event Horizon Telescope were beyond what the earlier processing algorithms could make sense of.  The software team led by Katie Bouman developed the CHIRP algorithm that kind of blows my mind.  It warms my heart that women in science are finally getting some of the recognition they deserve.  (It also depresses me greatly that misogynist trolls are getting some press; geeze, can't we just enjoy the accomplishment?)  Anyway, Katie did a Ted Talk a few years ago that gives some excellent explanation about the algorithm.

If you want some understanding of why the image looks the way it does, I think that Derek Muller's Veritasium video does the best job that I've seen.  He also has a good follow-up video.

I also really appreciated astrophysicist Becky Smethurst's video that explains why the results are important (it's more than just further supporting Einstein's theory of gravitation).

Thursday, January 24, 2019

Volatile considered harmful

I happened on this today.  The article is narrowly-focused on Linux kernel work, but in my mind it helps to clarify a lot of "volatile" debate I've seen over the years.

https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html


I will note that when Corbet (the author) says, "the 'volatile' type class should not be used", what he really means is that you should not declare variables with volatile (or rather, almost never).  Corbet says, "the kernel primitives which make concurrent access to data safe ... If they are used properly, there will be no need to use volatile as well."  Some of those kernel primitives use volatile, but not in variable declarations.  Instead they use volatile in carefully-selected casts.

For example, as described in another Corbet article, he talks about another kernel primitive, "ACCESS_ONCE()".  It is defined as:

    #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

The variable being accessed is temporarily cast to volatile to allow code to be written that violates threading assumptions made by the compiler's optimizer.  Like here, for example.  :-)

I only bring this up to point out that Corbet is not arguing that volatile should never be used at all.  Rather he is arguing that programmers (specifically kernel programmers) should not declare variables to be volatile.  Generally, programmers should use threading primitives to ensure correct code, and if the code's requirements prevent the use of the usual threading primitives, then lower-level primitives (like "ACCESS_ONCE()") should be used to precisely target volatile's use.

Tuesday, September 11, 2018

Safe sscanf() Usage

The scanf() family of functions (man page) are used to parse ascii input into fields and convert those fields into desired data types.  I have very rarely used them, due to the nature of the kind of software I write (system-level, not application), and I have always viewed the function with suspicion.

Actually, scanf() is not safe.  Neither is fscanf().  But this is only because when an input conversion fails, it is not well-defined where the file pointer is left.  Fortunately, this is pretty easy to deal with - just use fgets() to read a line and then use sscanf().

You can use sscanf() safely, but you need to follow some rules.  For one thing, you should include field widths for pretty much all of the fields, not just strings.  More on the rules later.


Input Validity Checking with sscanf()

I am a great believer in doing a good job of checking validity of inputs (see strtol preferred over atoi).  It's not enough to process your input "safely" (i.e. preventing buffer overflows, etc.), you should also detect data errors and report them.  For example, if I want to input the value "21" and I accidentally type "2`", I want the program to complain and re-prompt, not simply take the value "2" and ignore the "`".

One problem with doing a good job of input validation is that it often leads to lots of checking code.  In strtol preferred over atoi, my preferred code consumes 4 lines, and even that is cheating (combining two statements on one line, putting close brace at end of line).  And that's for a single integer!  What if you have a line with multiple pieces of information on it?

Let's say I have input lines with 3 pipe-separated fields:

  id_num|name|age

where "id_num" and "age" are 32-bit unsigned integers and "name" is alphabetic with a few non-alphas.  We'll further assume that a name must not contain a pipe character.

Bottom line: here is my preferred code:

#define NAME_MAX_LEN 60
. . .
unsigned int id;
char name[NAME_MAX_LEN+1];  /* allow for null */
unsigned int age;
int null_ofs = 0;

(void)sscanf(iline,
    "%9u|"  /* id */
    "%" STR(NAME_MAX_LEN) "[-A-Za-z,. ]|"  /* name */
    "%3u\n"  /* age */
    "%n",  /* null_ofs */
    &id, name, &age, &null_ofs);

if (null_ofs == 0 || iline[null_ofs] != '\0') {
    printf("PARSE ERROR!\n");
}


Error Checking

My long-time readers (all one of you!) will take issue with me throwing away the sscanf() return value.  However, it turns out that sscanf()'s return value really isn't as useful as it should be.  It tells you how many successful conversions were done, but doesn't tell you if there was extra garbage at the end of the line.  A better way to detect success is to add the "%n" at the end to get the string offset where conversion stopped, and make sure it is at the end of the string.

If any of the conversions fail, the "%n" will not be assigned, leaving it at 0, which is interpreted as an error.  Or, if all conversions succeed, but there is garbage at the end, "iline[null_ofs]" will not be pointing at the string's null, which is also interpreted as an error.


STR Macro

More interesting is the "STR(NAME_MAX_LEN)" construct.  This is a specialized macro I learned about on StackOverflow:

#define STR2(_s) #_s
#define STR(_s) STR2(_s)

Yes, both macros are needed.  So the construct:
  "%" STR(NAME_MAX_LEN) "[-A-Za-z,. ]"
gets compiled to the string:
  "%60[-A-Za-z,. ]"

So why use that macro?  If I had hard-coded the format string as "%60[-A-Za-z,. ]", it would risk getting out of sync with the actual size of "name" if we find somebody with a 65 character name.  (I would like it even more if I could use "sizeof(name)", but macro expansion and "sizeof()" are handled by different compiler phases.)


Newline Specifier

This one a little subtle.  The third conversion specifier is "%3u\n".  This tells sscanf() to match newline at the end of the line.  But guess what!  If I don't include a newline on the string, the sscanf() call still succeeds, including setting null_ofs to the input string's null.  In my opinion, this is a slight imperfection in sscanf(): I said I needed a newline, but sscanf() accepts my input without the newline.  I suspect sscanf() is counting newline as whitespace, which it knows to skip over.

If I *really* want to require the newline, I can do:

if (iline[null_ofs - 1] != '\n') then ERROR...


Integer Field Widths

Finally, I add field widths to the unsigned integer conversions.  Why?  If you leave it out, then it will do a fine job of converting numbers from 0..4294967295.  But try the next number, 4294967296, and what will you get?  Arithmetic overflow, with a result of 0 (although the C standards do not guarantee that value).  And sscanf() will not return any error.  So I restrict the number of digits to prevent overflow.  The disadvantage is that with "%9u" you cannot input the full range of an unsigned integer.  An alternative is to use "%10Lu" and supply a long long variable.  Then range check it in your code.  Or you could just use "%8x" and have the user input hexidecimal.  Heck, you could even input it as a 10-character string and use strtol()!

I guess "%9u" was good enough for my needs.  I just wish you could just tell sscanf() to stop converting on arithmetic overflow!


Not Perfect

I already mentioned arithmetic overflow. Any other problems with the above code?  Well, yeah, I guess.  Even though I specify "id" and "age" to be unsigned integers, I notice that sscanf() will allow negative numbers to be inputted.  Sometimes we engineers consider this a feature, not a bug; we like to use 0xFFFFFFFF as a flag value, and it's easier to type "-1".  But I admit it still bothers me.  If I input an age of -1, we end up with the 4-billiion-year-old man (apologies to Mel Brooks).


Monday, May 22, 2017

Some multicast programming tips

Never too old to learn.  :-)

There are lots of multicast example programs out there, so I won't try to compete with them.  But I did run across several things that weren't explained very well.


Single Socket, Multiple Groups

Yes, you can create a single socket and have it receive datagrams from multiple multicast groups.  Just include multiple calls to:
  setsockopt(recv_sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, ...


Multiple Sockets, One Group per Socket

This is another common use case, where you create multiple sockets for receiving, with each socket joined to a different multicast group.


Binding the Receive Socket

Since a socket needs to be bound to a port to receive any kind of UDP datagram, multicast or unicast, you need to include a call to bind().  You pass in a sockaddr_in with the sin_port set as desired (remember to pass it in network order).  But what about the sin_addr?  What do you set that to?

Many people set it to INADDR_ANY, which is what I did in a recent program.  But in the multiple sockets, different group per socket case, it had an unexpected side effect.  All of my sockets were bound to the same destination port, but joined to different multicast groups.  With sin_addr set to INADDR_ANY, the kernel took each received datagram, replicated it, and delivered a copy to *every* socket, even if the datagram's destination group is different from the one joined to the stocket! I.e. simply doing the IP_ADD_MEMBERSHIP on a socket didn't filter datagrams based on the desired group.  When a multicast datagram was received, the kernel just used the destination port and delivered a copy to every UDP socket bound to that port and INADDR_ANY.

I had to do some extra searching to find out that you can set the bind's sin_addr to the multicast group.  I have some reason to suspect that this is not portable across all operating systems, but at least it works on Linux.  Now I can have 10 sockets, each bound to the same port (don't forget SO_REUSEADDR) but different multicast groups.  When a multicast datagram is received, it is delivered *only* to the socket which is bound to the right port/multicast group pair.


Single Socket, Multiple Groups, reprise

So, what about the case where you have a single socket joined to multiple groups?  In that case, you *do* want to use INADDR_ANY in the bind.


Mix and Match?

I guess this poses a restriction.  You can't have, say, 2 sockets that you distribute 4 multicast groups across, with two groups each.  Why would you want to do that?  Maybe to load-balance across threads.  But assuming they all want to bind to the same port, you can't do it.  Setting the sin_addr to INADDR_ANY prevents filterig, and will mean that both sockets will receive a copy of every datagram sent. But you can't set sin_addr to multiple multicast groups.

So if you want to have multiple sockets, multiple groups, and the same destination port, you need to have one group per socket, and bind that socket to the group.

Monday, May 15, 2017

WannaCrypt / WannaCry ransomeware

I'm not a security researcher, and I don't follow the subject very closely.  But here is an interesting read by the person who slowed the spread of the recent WannaCrypt / WannaCry ransomware outbreak.

https://www.malwaretech.com/2017/05/how-to-accidentally-stop-a-global-cyber-attacks.html

Wednesday, July 8, 2015

UTF-8, Unicode, and International Character Sets (Oh My!)

I'm a little embarrassed to admit that in my entire 35-year career, I've never really learned anything about international character sets (unicode, etc).  And I still know very little.  But recently I had to learn enough to at least be able to talk somewhat intelligently, so I thought I would speed up somebody else's learnings a bit.

A more in-depth page that I like is: http://www.cprogramming.com/tutorial/unicode.html

Unicode - a standard which, among other things, specifies a unique code point (numeric value) to correspond with a printed glyph (character form).  The standard defines a code point space from 1 to 1,114,111 (1 to 0x10FFFF), but as of June 2015 it only assigns 120,737 of those code points to actual glyphs.

UTF-8 - a standard which is basically a means to encode a stream of numeric values with a range from 1 to 2,147,483,647 (1 to 0x7FFFFFFF), using a variable number of bytes (1-6) to represent each value such that numerically smaller values are represented with fewer bytes.  For example, the number 17 requires a single byte to represent, whereas the number 2,000,000,000 requires 6 bytes.  Notice that the largest Unicode code point only requires 4 bytes to represent.


So the Unicode standard specifies that a particular numeric value corresponds to a specific printed character form, and UTF-8 specifies a way to encode those numeric values.  Although the UTF-8 encoding scheme could theoretically be used to encode numbers for any purpose, it was designed to encode Unicode characters reasonably efficiently.  The efficiency derives from the fact that Unicode biases the most-frequently used characters to smaller numeric values.

UTF-8 and Unicode were designed to be backward compatible with 7-bit ASCII, in that the "printable" ASCII characters have numeric values equal to the first 127 Unicode characters (1-127), and UTF-8 can represent those values with a single byte.  Thus the string "ABC" is represented in ASCII as 3 bytes with values 0x41, 0x42, 0x43, and those same three bytes are the valid UTF-8 encoding of the same string in Unicode.  Thus, an application designed to read UTF-8 input is able to read plain ASCII text.  And an application which only understands 7-bit ASCII can read a UTF-8 file which restricts itself to the first 127 Unicode characters.

Another nice thing about UTF-8 is that the bytes of a multi-byte character cannot be confused with normal ASCII characters; every byte of a multi-byte character has the most-significant bit set.  For example, the tilda-n character "ñ" has a Unicode code point 241, and the UTF-8 encoding of 241 is the two-byte sequence  0xC3, 0xB1.  It is also easy to differentiate the first byte of a multi-byte character from subsequent bytes.  Thus, if you pick a random byte in a UTF-8 buffer, it is easy to detect whether the byte is part of a mutli-byte character, and easy to find that character's first byte, or to move past it to the next character.

One thing to notice about UTF-8 is that it is trivially easy to contrive input which is illegal and will not properly parse.  For example, the bytes 0xC3, 0x41 is an illegal sequence in UTF-8 (0xC3 introduces a 2-byte character, and all bytes in a multi-byte character *must* have the most-significant byte set).


Other Encoding Schemes

There are other Unicode encoding schemes, such as UCS-2 and UTF-16, but their usage is declining.  UCS-2 cannot represent all characters of the current Unicode standard, and UTF-16 suffers from problems of ambiguous endianness (byte ordering).  Neither is backward compatible with ASCII text.

Another common non-Unicode-based encoding scheme is ISO-8859-1.  It's advantage is that all characters are represented in a single byte.  It's disadvantage is that it only covers a small fraction of the worlds languages.  It is backward compatible with ASCII text, but it is *not* compatible with UTF-8.  For example, the byte sequence 0xC3, 0x41 is a perfectly valid ISO-8859-1 sequence ("ÃA") but is illegal in UTF-8.  According to Wikipedia, ISO-8859-1 usage has been declining since 2006 while UTF-8 has been increasing.

There are a bunch of other encoding schemes, most of which are variations on the ISO-8859-1 standard, but they represent a small installed base and are not growing nearly as fast as UTF-8.

Unfortunately, there is not a reliable way to detect the encoding scheme being used simply by examining the input.  The input data either needs to be paired with metadata which identifies the encoding (like a mime header), or the user simply has to know what he has and what his software expects.

Most Unixes have available a program named "iconv" which will convert files of pretty much any encoding scheme to any other.  The user is responsible for telling "iconv" what format the input file has.


Programming with Unicode Data

Java and C# have significant features which allow them to process Unicode strings fairly well, but not perfectly.  The Java "char" type is 16 bits, which at the time Java was being defined, was adequate to hold Unicode.  But Unicode evolved to cover more writing systems and 16 bits is no longer adequate, so Java now supports UTF-16 which encodes a Unicode character in either 1 or 2 of those 16-bit chars.  Not being much of a Java or C# programmer, I can't say much more about them.

In C, Unicode is not handled natively at all.

A programmer needs to decide an encoding scheme for in-memory storage of text.  One disadvantage to using something like UTF-8 is that the number of bytes per character varies, making it difficult to randomly access characters by their offset.  If you want the 600th character, you have to start at the beginning and parse your way forward.  Thus, random access is O(n) time instead of O(constant) time that usually accompanies arrays.

One approach that evolved a while ago was the use of a "wide character", with the type "wchar_t".  This would allow you to declare an array of "wchar_t" and be able to randomly access it in O(constant) time.  In earlier days, it was thought that Unicode could live within the range 1-65535, so the original "wchar_t" was 16 bits.  Some compilers still have "wchar_t" as 16 bits (most notably Microsoft Visual Studio).  Other compilers have "wchar_t" as 32 bits (most notably gcc), which makes them a candidate for use with full unicode.

Most recent advice I've seen tells programmers to avoid the use of "wchar_t" due to its portability problems and instead use a fixed 32-bit type, like "uint32_t", which sadly did not exist in Windows until Visual Studio 2010, so you still need annoying conditional compiles to make your code truly portable.  Also, an advantage of wchar_t over uint32_t is the availability of wide flavors of many standard C string handling functions (standardized in C99).

Other opinionated programmers have advised against the use of wide characters altogether, claiming that constant time lookup is vastly overrated since most text processing software spends most of its time stepping through text one character at a time.  The use of UTF-8 allows easy movement across multi-byte characters just by looking at the upper bits of each byte.  Also, the library libiconv provides an API to do the conversions that the "iconv" command does.

And yet, I can understand the attraction of wide (32-bit) characters.  Imagine I have a large code base which does string manipulation.  The hope is that by changing the type from char to wchar_t (or uint32_t), my for loops and comparisons will "just work".  However, I've seen tons of code which assumes that a character is 1 byte (e.g. it will malloc only the number of characters without multiplying by the sizeof the type), so the chances seem small of any significant code "just working" after changing char to wchar_t or uint32_t.

Finally, note that UTF-8 is compatible with many standard C string functions because null can be safely used to indicate the end of string (some other encoding schemes can have null bytes sprinkled throughout the text).  However, note that the function strchr() is *not* generally UTF-8 compatible since it assumes that every character is a char.  But the function strstr() *is* compatible with UTF-8 (so long as *both* string parameters are encoded with UTF-8).


Bottom Line: No Free (or even cheap) Lunch

Unfortunately, there is no easy path.  If I am ever tasked with writing international software, I suspect I will bite the bullet and choose UTF-8 as my internal format.