Tuesday, September 23, 2014

Tail Call Recursion in Clojure


In this article, I explain tail call recursion, show an example in both Clojure and common lisp, opine that Clojure's approach to tail call recursion optimization is better than lisp's, and finally look at Clojure's "lazy-seq" which seems to mysteriously provide the benefits of tail call recursion optimization for algorithms which are *not* tail call recursive.

Note: in this article, I refer to "tail call recursion", but I am actually dealing with a subset of that domain: tail *self* call (a.k.a. "direct") recursion. The theories presented also apply to the general case, including mutual (a.k.a. "indirect") recursion. However, optimization of tail *mutual* call recursion is much more rare in compilers than optimization of tail *self* call recursion. So I'm only considering self call.


What Is Tail Call Recursion?

Tail call recursion is a special case of recursion. With recursion, a function calls itself. *Tail* call recursion is where the function calls itself, and then immediately returns, either returning nothing, or returning the value of the recursive call. I.e. it doesn't call itself, and then do something with the results recursive call's value before returning. The reason tail call recursion is an interesting special case is that it can be optimized in a way that doesn't require a stack frame for each recursive invocation (i.e. it converts it into a simple loop). This makes it faster and conserves stack space.

Take the most-obvious way to program factorial recursively in clojure:

(defn fact1 [x]
  (if (<= x 1)
    1N
    (* x (fact1 (dec x)))))

(fact1 5) ;= 120N
(fact1 9000) ;= StackOverflowError

This is not tail call recursive. If you call (fact 5), that initial invocation needs to wait for the recursive invocation to return the value of 4! so that it can multiply it by 5 and return the result. It needs to maintain the state of x across the recursive invocation, and it uses the stack for that. Clojure's stack seems awfully small to me; less than 9000 levels deep.

It's easy to modify fact1 to be tail recursive. It requires a modified version in which you pass a "result so far" as an additional input parameter:

(defn fact2-tail [x result-so-far]
  (if (<= x 1)
    result-so-far
    (fact2-tail (dec x) (* x result-so-far))))

(fact2-tail 5 1N) ;= 120N
(fact2-tail 9000 1N) ;= StackOverflowError

(If you don't like having to pass the "1N", you can simply define a wrapper function.) This is tail call recursive because all the work, including the multiplication, is done before the recursive call.  So theoretically no state needs to be stacked up as the recursion goes deeper.  But as you can see by the stack overflow, Clojure doesn't realize this.

Be aware that some lisps will do automatic tail call optimization (or at least tail self-call optimization). I tried the equiv thing with clisp and it worked fine with big values:

(compile
 (defun fact2-tail-lisp (x result-so-far)
   (if (<= x 1)
     result-so-far
     (fact2-tail-lisp (1- x) (* x result-so-far)))))

(fact2-tail-lisp 59999 1) ;= 260,630-digit number starting with 2606

So what's a Clojure programmer to do? Use the loop/recur construct. Let's convert fact1, which is NOT tail call recursive, to loop/recur:

(defn fact1-loop [x]
  (loop [i x]
    (if (<= i 1)
      1N
      (* i (recur (dec i))))))
CompilerException java.lang.UnsupportedOperationException: Can only recur from tail position

Interesting. Clojure is NOT smart enough to recognize tail call recursion in fact2-tail, but it is smart enough to recognize NON-tail call recursion in fact1_loop. So the Clojure programmer must both recode the algorithm to be tail call recursive and code it to use loop/recur:

(defn fact2-loop [x]
  (loop [i x, result-so-far 1N]
    (if (<= i 1)
      result-so-far
      (recur (dec i) (* i result-so-far)))))

(fact2-loop 5) ;= 120N
(fact2-loop 59999) ;= 260,630-digit number starting with 2606

There! One advantage to using loop/recur is you don't need a separate function which takes the extra result-so-far parameter. It's kind of built in to the loop construct.


Which Is Better?

So, which is better: Clisp, which automatically did the optimization, or Clojure, where the programmer basically has to *tell* the compiler to optimize it with loop/recur?

I have to side with Clojure, and I'll tell you why. I started my experimentations knowing that lisp did the optimization automatically. I coded it up, and it overflowed the stack. Huh? I had to do a bunch of googling to discover this page which mentions in footnote 1 that you have to compile the code to get it to do the optimization. Had I not tried it with a large number, or if I had mis-coded the algorithm to not really be tail call recursive, then I would have put in some effort to achieve an intended benefit, would not have received that benefit, and I might not have known it.

Most of the time, you need to carefully code your algorithm to be tail call recursive. I prefer Clojure's approach where you tell the compiler that you intend it to perform the optimization, and it gives you an error if it can't, as opposed to lisp which silently leaves it un-optimized if you make a mistake.  It also gives a hint to a future maintainer that your potentially awkward algorithm is coded that way for a reason (admit it: fact2-tail and fact2-loop are more awkward algorithms than fact1).


What About lazy-seq?

Now I'm experimenting with lazy-seq, which can also be used to avoid stack consumption. It's a different animal -- the factorial function only returns a single value, whereas lazy-seq is intended to assemble a list (or more-accurately, a lazy sequence).

Let's start with re-writing fact1 to return a list of all factorials, from 1 to the requested value:

(defn fact3 [x i result-so-far]
  (if (> i x)
    '()
    (cons
     (* i result-so-far)
     (fact3 x (inc i) (* i result-so-far)))))

(fact3 5 1N 1N) ;= (1N 2N 6N 120N)
(last (fact3 5 1N 1N)) ;= 120N
(last (fact3 9000 1N 1N)) ;= StackOverflowError

This function is recursive but doesn't use loop/recur. No wonder it blew up the stack. To convert this to loop/recur requires a change in the output.  The final return value has to be the entire list, so the "cons"ing needs to be done on-the-fly each time through the loop. Thus, larger factorial values are "cons"ed onto a list-so-far of smaller factorial values, resulting in a reversed list:

(defn fact3-loop [x]
  (loop [i 2N, result-so-far '(1N)]
    (if (> i x)
      result-so-far
      (recur (inc i) (cons (* i (first result-so-far)) result-so-far)))))

(fact3-loop 5) ;= (120N 24N 6N 2N 1N)
(first (fact3-loop 5) ;= 120N
(first (fact3-loop 9000)) ;= 31,682-digit number starting with 8099
(first (fact3-loop 59999)) ;= OutOfMemoryError  Java heap space

We solved the stack problem, but ran out of heap assembling the massive list (note that most of the elements of the list hold VERY large numbers, which consumes significant space for each list element).

lazy-seq to the rescue! Here's it is in action:

(defn fact3-lazy [x i result-so-far]
  (if (> i x)
    '()
    (cons
     (* i result-so-far)
     (lazy-seq (fact3-lazy x (inc i) (* i result-so-far))))))

(fact3-lazy 5 1N 1N) ;= (1N 2N 6N 24N 120N)
(last (fact3-lazy 5 1N 1N) ;= 120N
(last (fact3-lazy 9000 1N 1N) ;= 31,682-digit number starting with 8099
(last (fact3-lazy 59999 1N 1N) ;= 260,630-digit number starting with 2606

Look at the algorithm; it's almost identical to fact3. I.e. it is *not* tail call recursive! Each recursive call must return its list, which is then "cons"ed onto (* i result-so-far).

The lazy-seq function allows Clojure to conserve stack, even though the algorithm is not tail call recursive. It is also able to reclaim unneeded early list items (I'm only interested in the last one), so it conserves heap.

Did Clojure figure out a way to not have to maintain state across all 59999 invocations? Or is it just doing it with fact3-lazy in a more-space-efficient way than fact3? I'm guessing that it is maintaining state on the heap instead of the stack, but I'll be honest, I'm not exactly sure what's going on.

Monday, September 22, 2014

The Meditative Coder


I changed the name of my blog from "Sford's Blog" to "The Meditative Coder".

A bit pretentious or pompous? I hope not. I do think about code a lot.

I'm not talking about design. Lots of people could claim to be a meditative designer, striving to make their structure and algorithm are elegant, efficient, and robust. I cut my professional teeth in a culture where coding was considered a low art. Design was where the prestige lay. And architecture? That gets you a corner office and groupies. I'm not kidding! I've seen some pretty mediocre programmers put on a pedestal when they achieved architect status.

I don't want to minimize the importance of a good architectural framework and a good design. However, I think coding sometimes gets short shrift. I think that well laid-out  and organized code is beautiful. Given the choice, I would rather maintain a program whose design is so-so and whose code is excellent, than vice versa.

I will admit to spending significant time after I've gotten a program working, just to make the code more readable -- changing variable and function names to be more-precisely descriptive, adding or subtracting whitespace to make the code clauses more logical, even re-factoring the code a bit to have cleaner, more-intuative abstractions, and even adding comments (judiciously).

So yeah, I think I'm a pretty meditative coder, perhaps more-so than average.

Finally, while "Sford's Blog" appeals to my minimalist aesthetic, it was both correct and useless. Besides, "meditative coder" got absolutely NO hits on Google; how can I resist? :-)

Friday, September 12, 2014

Clisp Performance

In my previous post, I did some speed measurements of Clojure.  John commented with a test in common lisp, which performed MUCH better than Clojure.

So I did some more experimentation.  Based on John's "fact2" results, his laptop appears to be about 3x faster than mine; not surprising since I have a 2nd gen Mac Air, and he probably has a real computer :-).  But when I ran John's fact4 with clisp, I got:

(time (fact4 1000))
Real time: 0.003765 sec.
Run time: 0.003765 sec.

Whoa!  That's over 200 times slower!  I tried it again.  And again.  And again.  Same results.

Then I tried (fact4 10000) and got "APPLY: too many arguments given to +".  Oh yeah, now I remember that about clisp.  :-)

Then I tried John's trick of timing a loop of them:

(time (dotimes (i 1000 (fact4 1000))))
Real time: 0.005146 sec.
Run time: 0.005117 sec.

Each run of (fact4 1000) now took 5 microseconds. I could imagine some overhead which needs to be amortized across multiple loopings, but that seems extreme.

Look at this:

(progn
  (time (dotimes (i 1 (fact4 1000))))
  (time (dotimes (i 10 (fact4 1000))))
  (time (dotimes (i 100 (fact4 1000))))
  (time (dotimes (i 1000 (fact4 1000)))))
Real time: 0.003783 sec.
Run time: 0.003783 sec.
Space: 8760 Bytes
Real time: 0.002432 sec.
Run time: 0.002394 sec.
Space: 8760 Bytes
Real time: 0.002528 sec.
Run time: 0.002528 sec.
Space: 8760 Bytes
Real time: 0.003901 sec.
Run time: 0.003832 sec.
Space: 8760 Bytes

They ALL took approximately the same time!  Is clisp somehow not doing the loop?  Maybe it sees that there are no side-effects, and just optimizes out all but one run?  Maybe it implicitly does Clojure's "memoize" function of caching results?  I kept going:

(progn
  (time (dotimes (i 1000 (fact4 1000))))
  (time (dotimes (i 10000 (fact4 1000))))
  (time (dotimes (i 100000 (fact4 1000))))
  (time (dotimes (i 1000000 (fact4 1000)))))
Real time: 0.004611 sec.
Run time: 0.00461 sec.
Space: 8760 Bytes
Real time: 0.017613 sec.
Run time: 0.01703 sec.
Space: 8760 Bytes
Real time: 0.126134 sec.
Run time: 0.125795 sec.
Space: 8760 Bytes
Real time: 1.231531 sec.
Run time: 1.230281 sec.
Space: 8760 Bytes

FINALLY!  The times start increasing by factors of 10 after about 10,000 runs, settling down to about 1.2 microseconds per "(fact4 1000)".  John - I suspect that you would see even better performance if you did more loops.

Anyway, I went back to the Clojure programs and did the same thing with timing multiple loopings.

(do
  (time (dotimes [i 1] (fact1 1000)))
  (time (dotimes [i 10] (fact1 1000)))
  (time (dotimes [i 100] (fact1 1000)))
  (time (dotimes [i 1000] (fact1 1000))))
"Elapsed time: 0.396 msecs"
"Elapsed time: 2.413 msecs"
"Elapsed time: 39.325 msecs"
"Elapsed time: 250.707 msecs"

The timings are basically consistent with a 10-fold increase in time with each 10-fold increase in loops.  Yes, there is variability (presumably due to GC), but multiple runs have tended to prove the rule.

I get the same trends with fact2 and fact3.  So my originally-posted timings were about right.

Conclusion?  I've spent WAY too much time on this issue this morning and I'm going to be late to work.  I think I'll just skip my morning exercise and only be a little late.  :-)

(Update: my original C-language timings were wrong.  I've corrected this post based on correct C timings.  I no longer mistrust the "clisp" results, and have removed claims to that effect.)

Update 2: now that I trust the timing, I did a bit more number crunching. I'll assume that clisp can run a single call to (fact4 1000) in 1.2 microseconds. There is obviously some kind of overhead associated with the first call. (Maybe clisp is compiling "fact4" just in time?  I.e. after "time" took the start timestamp?) Let's say that it takes 3500 microseconds for that overhead. So, with varying number of loops, I would expect to see the following:

3500 + 1.2 * 1 = 3501.2
3500 + 1.2 * 10 = 3512
3500 + 1.2 * 100 = 3620
3500 + 1.2 * 1000 = 4700
3500 + 1.2 * 10000 = 15500
3500 + 1.2 * 100000 = 123500
3500 + 1.2 * 1000000 = 1203500

Given the wide variability of individual timings, this matches reasonably well the results above. I don't think clisp is doing anything too fishy.

Thursday, September 11, 2014

Clojure performance

There is a cute site www.4clojure.com which poses coding challenges to Clojure programmers.  I don't have  lot of time each day, so I'm only up to #42, but I'm having fun.  The actual problems have been trivial so far, targeted to giving a beginner practice with Clojure features.  One nice thing about it is that you can "follow" somebody else to see what their solutions are (but you can only see them *after* you've solved it yourself).  I've found a pretty experienced Clojure programmer to follow, so it's been very educational to compare my sometimes-brute force solutions with his more elegant solutions.

For example, problem #42 is to write a factorial function.  For the purposes of this post, I'm changing the factorial algorithm to do addition rather than multiplication.  This is because I wanted to test it with large numbers without arithmetic overflow to get timing information.

As a long-time C programmer, I would just write:

long long fact (long long n)
{
  long long r = 1;
  long long i;
  for (i = 2; i <= n; i++)
    r += i;    /* not really factorial */
  return r;
}

I timed that (on my 2nd gen Mac Air) for n=1,000 and it took 6 nanoseconds.

A beginning Clojure functional programmer might do it recursively like this:

(defn fact1 [n]
  (if (= n 1)
      1
      (+ n (fact1 (dec n)))))

For n=1,000, this takes about 330 microseconds - about 55,000 times longer than C.  :-(

The more-experienced guy solved it this way:

(defn fact2 [n]
  (apply + (range 1 (inc n))))

This is very elegant and uses some of Clojure's interesting features.  As soon as I saw it, I thought, "wow, that probably took him 10 seconds to write."  It executes in almost exactly the same time as the recursive version - about 330 micros.

Here is how I wrote it:

(defn fact3 [n]
  (loop [sum-so-far 1, i n]
    (if (<= i 1)
      sum-so-far
      (recur (+ sum-so-far i) (dec i)))))

Note that this is arguably the most-complex of all the solutions.  Part of the reason I wanted to write it this way was that I wanted to practice a method called "tail recursion".  In this method, the algorithm is technically recursive (hence the "recur" function), but the compiler can trivially convert it into a simple loop.  I.e. this method does not actually create 1000 stack frames.  The reason for doing this is to make it more efficient.

My version takes about 200 microseconds, only 33,000 times longer than C.  :-)

What happens if we go with n=10,000?  Surprise!  The purely-recursive version gets a stack overflow.  (That seems like an awfully small stack to me...)  The other versions each take about 10 times longer, as you would expected.

The moral of the story?  Use Clojure when you want short development cycles, not high performance.  And by that measure, the *real* winner is that more-experienced Clojure programmer who wrote "(apply + (range 1 (inc n)))".  No "if" statements to get wrong, no partial-sum to keep track of, no stacks to overflow.  By a HUGE margin, the fastest to get to a working program.

(Update: my first posting used mistakenly used C timing without optimizing.  I've corrected this article with timings using "-O3" optimization.  The optimizer gave a 600x speed improvement, from 4 microseconds to 6 nanoseconds.  Wow!)

Monday, September 1, 2014

Clojure tools: bumps on the road

I've been enjoying getting my tools sharpened and ready for use.  John Jacobsen describes a nice set of workflow habits and tools that he uses in his blog.  It includes Emacs, Leiningen, and Cider.

27-Sep-2014: I've made in-place updates to this post.


lein print and println: no longer broken

Using the Clojure REPL in Emacs via Cider, it mostly worked, but the "print" and "println" functions were broken at one point:

    user> (print 1)
    1NoSuchMethodError clojure.tools.nrepl.StdOutBuffer.length()I ...

Good old Google.  A search for the above reveals a Leiningen issue which indicates the "stable" version 2.4.3 has a problem.  When this post was first written, the fix was not yet available. Fortunately, there was a workaround: downgrade to version 2.4.2:

    lein upgrade 2.4.2

This is no longer necessary: lein 2.5 is fixed:

    user> (print 1)
    1
    nil


Emacs packages: chicken-and-egg problem?

To get a productive Emacs-based Clojure environment, you need Emacs packages.  I had some initial trouble with it not finding the Marmalade packages I asked for.  I read some advice telling me to refresh the packages, but it wasn't exactly clear how to do it all in one go.  To complicate matters, I found myself trying out the approaches of several bloggers, leading me to need to clear my config and start over more than once.

(I tried "prelude" Emacs package, but it was rather wonky and VERY slow to start up, so gave up and went with a bunch of stuff in John's config.)

So I created "refresh_pkgs.el" containing the following:

    (require 'package)
    (add-to-list 'package-archives
             '("marmalade" . "http://marmalade-repo.org/packages/"))
    (package-initialize)
    (package-refresh-contents)

When I want to clear everything and start over, I do this:

    mv .emacs.d .emacs.d.sav
    mv .emacs .emacs.sav
    emacs -nw -q -l emacs_init/refresh_pkgs.el

When the welcome screen is displayed, exit Emacs with C-x C-c.  A new ".emacs.d" directory now exists with a refreshed Marmalade.  Now create the desired "init.el" (in the ".emacs.d" dir) and enter Emacs again.  This invocation will take a lot longer because it downloads and builds the desired packages.  Once again, when the welcome screen is finally displayed, exit Emacs.  You should be good to go.  (Note: I suspect that the "emacs -nw -q -l emacs_init/refresh_pkgs.el" command will be useful without starting over if there are new packages I want to try out.)

Here's the package loading section of my "init.el":

    (require 'package)
    (add-to-list 'package-archives
                 '("marmalade" . "http://marmalade-repo.org/packages/"))
    (package-initialize)
    (defvar my-packages '(better-defaults
                          auto-complete
                          undo-tree
                          paredit
                          auto-indent-mode
                          projectile
                          clojure-mode
                          cider))
    (dolist (p my-packages)
      (when (not (package-installed-p p))
        (package-install p)))

I won't show the rest because I'm sure I'll be making a lot of changes over the weeks to come.


emacs: melpa v.s. marmalade

Ugh.  For various reasons, I combined marmalade with melpa.  The net result was a nightmare of things not working quite right.  Apparently when both are enabled, the above package list will fetch incompatible versions of different packages.  I tried going with melpa-only, and several packages were not available.  I could have messed with it for longer, but I just wanted things to work again.  So I started from scratch again with the above settings.  With the latest lein, I'm back to everything working.

Saturday, August 23, 2014

Now starting on Clojure

I think I've gotten enough Lisp under my belt for now.  Not that I'm proficient at Lisp programming, but I think I understand enough to take my next step: Clojure.  I'm not very far along, but I have to say that I think I like it.

Clojure (pronounced "closure") is a young language (2007) with the following awesome properties:
  1. It is a functional programming language, though not as purist as Haskell.
  2. It is a variant of Lisp.
  3. It is build on top of the Java Virtual Machine, meaning that a Clojure program can access the entire Java standard library.
  4. It has built-in features to support concurrency and parallelism.
I've finished the first chapter of Emerick, Carper, and Grand's "Clojure Programming" book, and I've got to say that the information density so far is very high.  If it continues like this, it will be slow going.  I guess I like the book so far, but it's nowhere near as fun as "Land of Lisp".  I could gripe a bit about the index being very incomplete, but I got both the paper and the PDF versions of the book, so I can simply search the PDF.

Anyway, as with Lisp, I'm nowhere near at the level where I can talk intelligently about Clojure, so I won't try.

Thursday, August 14, 2014

Learning Lisp

For reasons I don't completely understand, I sometimes forgot an important fact about myself: I love learning.  Oh sure, we all learn little things all the time, satisfying our fleeting curiosities with wikipedia.  But I also like to sit down to a full course meal of learning.

So I recently decided to learn lisp.

Most professional programmers will ask me: why lisp?  It is barely used outside of academia, and isn't even used within academia that much.  It won't get me a job.  It's not very practical.

Fortunately, at my age, I don't have to care about the practicality of my interests.  :-)

Part of my desire to learn lisp comes from guilt associated with this blog.  Lisp is one of the pioneering computer languages.  Writing about programming without being familiar with lisp feels a bit like writing about relativity without being familiar with Newtonian physics.  Or writing about race relations without being familiar with slavery.  How can I claim to be a well-rounded developer with only knowledge of one of the pillars of the field?

In an earlier post, I mentioned finally being able to "get" lisp programming after spending 30 minutes with an on-line tutorial.  But I wanted more.  So I bought Land of Lisp / Learn to Program in Lisp, One Game at a Time!, by Conrad Barski (cheaper on Amazon, but I wanted the DRM-free PDF).  I'm about half-way through it, and am having a ball.  Partly for the obvious reason that I really love learning things.  But also because this is a good book.

Not perfect -- I suspect that neither the editors nor the early reviewers were people actually learning lisp.  Since I am actively trying to become proficient in lisp, I make sure I fully understand every line of code presented.  There are some programs where something is presented, but not explained.  For example, a type of "vector" is used in one of the programs, but is not mentioned in the text.  It required external research to discover that a lisp vector is just a simple (one-dimensional) array.

I'm afraid I also have to complain a bit about the coding style.  Meaningful variable names are often not used.  Or a term is used to mean subtly different things at different times (like "edges").  The code could benefit from more-careful naming.

However, in spite of that, I absolutely LOVE the premise - teach lisp with real-life programs.  I hate examples which just demonstrate syntax and semantics (I'm not sure I can stomach another "towers of hanoi" example).  Land of Lisp successfully helps to show *why* you would want to use various constructs, using programs that actually accomplish something meaningful.  And I really enjoy the style of Dr. Barski's prose.  His sense of humor, and his decision to use games as the basis for his programs, makes the book a joy to read.

I'm not proficient enough to say anything very intelligent about lisp as a language, so I won't just yet. However, I am learning some new ways to look at algorithms.  Having multiple points of view will always help you with your designs.  It probably won't be worth the effort in any practical or economic sense, but at least it's better than watching NCIS marathons on cable!

Tuesday, July 22, 2014

Adventure! ... in BASIC???

Over the weekend, I was feeling sorry for myself for not having as much time to program as I would like, so I sneaked a few hours (about 6) to write an adventure game.  I thought a bit about what language to use.  Part of me wanted to do it in one of the functional languages, but I knew I wouldn't have enough time to both learn the language sufficiently AND write enough to be recognizably adventurous.

And then a wave of nostalgia swept over me.

BASIC.

And none of this new-fangled BASIC with block-structured coding constructs.  Basic BASIC.  With 1 or 2 character variable names, and lots of GOTOs.

I gotta say, I haven't had this much fun programming in years.  I actually chuckled at times, remembering the old days of:

3165 if w <= 9 then 3170
3167   print "Too many words!  Talk more simply."
3168   v$ = "" :  n$ = "" : return
3170 rem

OK, the pedantic among you will say that putting multiple statements on a line is not basic BASIC.  Oh well, I make the rules, I can break them.  But I did restrict myself to variable names of one letter, or one letter and one digit, with all variables gloriously global.  Man, kids today don't know how good they have it.

>load "a.txt"
>run
You are in a brightly-lit room.  Birds are singing, puppies and kitties are playing in the sunlight streaming in through cheerily open windows.  There is a hole in the north wall that looks just big enough to crawl through.
You see:
  a grey bag of scratchy cloth
? north
You are in a dimly-lit room.  Shadowey figures seem to slither towards you.  A low moan can faintly be heard.  What little light penetrates the ominous darkness comes from the south.
You see:
  a fabulous 100-carat diamond!
? take diamond
got it.
You are in dim room
? n
Can't go that way
You are in dim room
? s
You are in a brightly-lit room.  Birds are singing, puppies and kitties are playing in the sunlight streaming in through cheerily open windows.  There is a hole in the north wall that looks just big enough to crawl through.
You see:
  a grey bag of scratchy cloth
? look at the grey bag
a grey bag of scratchy cloth
which contains:
  an iron key, slightly rusted
You are in bright room
? take bag
 NEXT without FOR  in line 9075
>

Whoopsie!  Looks like I need a bit more debugging.  :-)

Or not.  The exercise accomplished its purpose, which honestly was more to feel young again than hone my coding skills.  I.e. I don't think I'll be putting this on my resume any time soon.  ;-)

Update: I thought I might put it up on github, not because it is a useful or notable piece of work, but because I wanted to be the first person to put BASIC on github.  No such luck.  There is a ton of BASIC already on github.  Oh well...

Tango.net / Tango.me SMS Text Spam

A colleague of mine downloaded the Tango app, which went into her contact list and spammed text messages.  The text consisted of, "Check out my photo on Tango" followed by a URL which was obviously not specific to a particular photo, or even a particular user.  I went ahead and followed the link and it was just an advertisement for the Tango app.

My colleague was mortified by the fact that she unknowingly spammed her contact list, especially since not all phone users are on unlimited texting plans, and assured me that she gave no authorization.  I suspect that she actually did authorize it, but under sufficiently-ambiguous wording that it wasn't obvious -- probably something like, "Do you wish to share your photos with your friends?"

Looking at Tango's terms of use, I see:

11. Third Party Fees.For particular Devices, Tango may ask your permission to use your native SMS application to deliver messages or invitations to people who are not registered users of the Services and with whom you choose to communicate. Some of these services may charge additional fees.

Yeah, I bet.

I am not happy.  I've never liked spam, and text message spam is especially bad since it is more disruptive and costly than email spam.

Please don't use Tango.net or Tango.me until they stop this deplorable practice.

Tuesday, June 24, 2014

My work email inbox is empty

Wow.  I think the last time my work in-box was empty was when I joined 29West, back in 2005.  It stayed empty for probably an hour.

I dunno ... it's actually a little scary looking.

Monday, June 16, 2014

Maglus Stylus

I am very happy with the Maglus stylus for my iPad.  I wrote an Amazon review of why I like it so much (there are specific reasons).

The reason for this post is that you can get a 10% discount if you use this link.  It's part of an "affiliate" program that also kicks me a bit of money, which I have re-directed to the American Red Cross.

That said, I did notice that on Amazon, I could find it for even less than the 10% off list, so if you're looking to save money, shop around.  I would rather see you save as much money as you can, and independently donate a decent amount to the Red Cross because you want to support their good work.

Anyway, I never feel 100% comfortable promoting a commercial product lest I come across as a corporate shill.  Then I saw a post on their corporate blog on James Joyce and Ulysses.  Somehow that makes me feel less guilty being a booster.  Which is really silly since I've never read Ulysses!

Hmm ... what tags should I set on this post?  I don't see one for "shameless commerce".  I guess "tips" since I really do like the stylus a lot.

Thursday, June 12, 2014

Order in Structure Assignment Statements

I had what I thought was a pretty simple piece of code, transferring a structure from one thread to another.  Without describing it blow-by-blow, I'll present it thus: a writer thread is measuring temperature and pressure of a steam turbine.  A reader periodically samples the measurement.  There is no need for the reader to get every measurement, it only needs the most-recent at the point that it samples.


DESIGN!

A global data structure:

struct measurement_s {
    volatile uint64_t post_seq;  /* sequence number of measurement */
    volatile  int64_t temperature;
    volatile  int64_t pressure;
    volatile uint64_t pre_seq;   /* sequence number of measurement */
} live;

The algorithm is for the writer to first increment pre_seq, then update the temperature and pressure, then increment the post_seq.  The reader reads the post_seq first, then temperature and pressure, then pre_seq.  If the reader sees pre_seq != post_seq, it means that a collision has happened, and the reader should re-try.  Here's the reader code:

        struct measurement_s sample;
        do {
            sample = live;
        } while (sample.pre_seq != sample.post_seq);
        /* now have valid sample */


BROKEN!

Pretty simple.  Easy to understand.  And it didn't work right.  The test code was written to make it obvious when inconsistent values are read, and the code above generated inconsistent reads.  A LOT of inconsistent reads.  How can this be???  I used volatile to prevent the optimizer from messing with the order of reads and writes, and it's an x86-64, which has a very predictable memory model, removing the need for most hardware memory barriers.  The algorithm is correct!  Why is it failing?


DIAGNOSE!

So I did what I've been doing a surprising amount recently: I generated assem language (this time withOUT optimization turned on).  Here's the structure assignment:

    leaq    -64(%rbp), %rax
    movq    _live+24(%rip), %rcx
    movq    %rcx, 24(%rax)
    movq    _live+16(%rip), %rcx
    movq    %rcx, 16(%rax)
    movq    _live+8(%rip), %rcx
    movq    %rcx, 8(%rax)
    movq    _live(%rip), %rcx
    movq    %rcx, (%rax)

HEY!  It copied the structure in the REVERSE ORDER than I expected!  So it copies the pre_seq first, then the temperature and pressure, and post_seq last!  No wonder it detected collisions, that order is NOT thread-safe!


HAPPILY EVER AFTER!

So I replaced the structure assign with explicit copies of the fields:

        struct measurement_s sample;
        do {
            /* Don't do "sample = live;", order of field copies is not defined. */
            sample.post_seq = live.post_seq;
            sample.temperature = live.temperature;
            sample.pressure = live.pressure;
            sample.pre_seq = live.pre_seq;
        } while (sample.pre_seq != sample.post_seq);

YAY!  It works.  (Whew!)

It would also be possible to define a sub-structure containing temperature, pressure, and any additional interesting data.  The fields in the sub-structure could be copied as struct assign, allowing the compiler to do it in any order.  But the pre_seq and post_seq must be individually copied in the right order.


DIGRESSIONS!

I do love those exclamation points!

So, the above code works.  How *good* is the design?  For physical process measurement, probably just fine.  It would be rare to measure physical characteristics more often than a few times per second, so the chances of the reader having to re-try the sample are very low.  (On the other hand, why not use mutexes if you are going that slow?)

But suppose this algorithm is applied to market data, which might have many thousands of ticks per second?  The algorithm shown suffers from a fundamental flaw: starvation.  If a writer and reader get pathologically synchronized, there is no bound to the number of times the reader will cycle, trying to get a clean sample.

In practice, it might be fine.  But whenever possible, I prefer a solution that works by design, not by circumstance.

I think a better approach might be to use a non-blocking queue, like this one.  The writer presumably requires non-zero work to produce a new record.  The consumer should read the queue dry in a tight loop before processing the last record read.  Now the queue need only be long enough to hold the maximum number of produced records between two reads.  A slow reader will simply clean out the queue each time it reads.


Some might question the wisdom of doing ANY kind of programming where the order of field copies in a struct assign is important.  In the old days, this program would have been done with a critical section, protected by a mutex or semaphore.  That works fine, and doesn't depend on order.  But high-performance applications of today need much lower latencies.  So lock-free algorithms are important.  With lock-free, you need to concern yourself with order; the compiler can change the order of the assem language operations from the source code, and the hardware can change the order of physical memory operations from the assem code.  Compile optimizer barriers address the former; hardware memory barriers address the latter.


Finally, some will question my choice of making all the fields of the structure volatile.  I admit it is a heavy-handed approach that I used for expediency -- this was a prototype after all.  For production code I would use a more-precisely targeted VOL macro (mentioned in an earlier post), often in cooperation with a general compile optimizer barrier, like "asm volatile ("" ::: "memory");".  Note that these techniques are NOT used to squeeze every nanosecond out of the code.  They are used to insure code correctness.

Tuesday, June 10, 2014

Error handling: the enemy of readability?

I've been a programmer since the mid-1970s, call it 40 years.  During that time I've seen the rise into popularity: structured programming and object-oriented programming.  (More-recently, I've seen increased interest in functional programming, and I've started learning a bit about it myself, but I'm not conversant-enough in it to write intelligently about it.  So I'll omit it from this discussion.)

After all these years, I find it very interesting that I haven't seen a common approach to internal error handling become widely-accepted, the way structure and object-orientation have.

I'm speaking here of internal errors, not user errors.  If the user enters bad input, you owe it to the user to detect the problem and report it in a user-friendly way, allowing the user to correct his mistake.  User error detection and handling code is fully part of the program.

Internal errors are different.  These are found by inserting sanity checks to see if something that "should never happen" might have actually happened.  Very often, failed sanity checks represent a software bug that needs to be tracked down and fixed.


ROBUST CODE vs. READABLE CODE?

Most approaches to internal error handling I've seen are poor.  Maybe it is partly due to a lack of programmer discipline - we all get lazy sometimes and forget to check a return status - but I don't think that is the primary reason.  I believe the main culprit is the need for readable code.

The greatest expense in any significant programming effort is not the original writing of code, it's the subsequent maintenance.  The original programmers eventually leave the project, and less-experienced programmers often take over the less-glamorous job of fixing bugs and adding enhancements.  It is critically important that the code be readable.  External documentation is typically out of date moments after it is written, and even internal documentation (comments) are suspect.  The code itself is the only reality that matters, and the harder it is to understand that code, the harder (and more expensive) it will be to maintain and improve it.

And unfortunately, doing a thorough job of checking and handling internal errors often makes the code harder to read.

C

Which is easier to read and maintain?  This:

example 1:
  foo_t *foo_create()
  {
      foo_t *new_foo = malloc(sizeof(foo_t));

      new_foo->state_filep = fopen("foo.state", "rw");
      new_foo->data_filep = fopen("foo.data", "rw");

      new_foo->reader = reader_create();
      new_foo->writer = writer_create();

      return new_foo;
  }  /* foo_create */

or this:

example 2:
  foo_t *foo_create()
  {
      foo_t *new_foo = malloc(sizeof(foo_t));
      if (new_foo != NULL) {  /* malloc successful? */

          new_foo->state_filep = fopen("foo.state", "rw");
          if (new_foo->state_filep != NULL) {  /* fopen successful? */
              new_foo->data_filep = fopen("foo.data", "rw");
              if (new_foo->data_filep != NULL) {  /* fopen successful? */

                  New_foo->reader = reader_create();
                  if (new_foo->reader != NULL) {  /* reader successful? */
                      new_foo->writer = writer_create();
                      if (new_foo->writer != NULL) {  /* writer successful? */

                          return new_foo;
                      }
                      else {
                          log_error("Could not create writer");
                          reader_delete(new_foo->reader);
                          fclose(new_foo->data_filep);
                          fclose(new_foo->state_filep);
                          free(new_foo);
                          return NULL;
                      }
                  }
                  else {
                      log_error("Could not create reader");
                      fclose(new_foo->data_filep);
                      fclose(new_foo->state_filep);
                      free(new_foo);
                      return NULL;
                  }
              }
              else {
                  log_error("Could not open data file");
                  fclose(new_foo->state_filep);
                  free(new_foo);
                  return NULL;
              }
          }
          else {
              log_error("Could not open state file");
              free(new_foo);
              return NULL;
          }
      }
      else {
          log_error("Could not malloc foo");
          return NULL;
      }
  }  /* foo_create */

Most projects I've worked on use a hybrid of the two approaches, but most lean heavily towards one or the other.  The first is easy to read, but brittle; if anything goes wrong, the problem often doesn't generate any visible symptoms till much later, making it a nightmare to diagnose.  The second detects errors early and does proper reporting and cleanup, but is a nightmare to read and maintain.  (All that duplicated code in the else clauses is sure to get out of sync at some point, resulting in resource leaks or uninitialized accesses, neither of which will be detected unless some error actually happens and the error handling code finally gets exercised.)

Wouldn't it be nice to have both good error detection and readability?

example 3:
  foo_t *foo_create()
  {
      foo_t *new_foo = NULL;
      new_foo = malloc(sizeof(foo_t));  ASSRT(new_foo!=NULL);
      memset(new_foo, 0, sizeof(new_foo));

      new_foo->state_filep = fopen("foo.state", "rw");  ASSRT(new_foo->state_filep!=NULL);
      new_foo->data_filep = fopen("foo.data", "rw");  ASSRT(new_foo->data_filep!=NULL);

      new_foo->reader = reader_create();  ASSRT(new_foo->reader!=NULL);
      new_foo->writer = writer_create();  ASSRT(new_foo->writer!=NULL);

      return new_foo;

  ASSRT_FAIL:
      if (new_foo != NULL) {
          if (new_foo->writer != NULL) reader_delete(new_foo0->writer);
          if (new_foo->reader != NULL) reader_delete(new_foo0->reader);
          if (new_foo->data_filep != NULL) fclose(new_foo0->data_filep);
          if (new_foo->state_filep != NULL) fclose(new_foo0->state_filep);
          memset(new_foo, 0, sizeof(new_foo));  /* poison the freed structure */
          asm volatile ("" ::: "memory");       /* See note [1] */
          free(new_foo);
      }

      return NULL;
  }  /* foo_create */

Note [1]: See http://blog.geeky-boy.com/2014/06/clangllvm-optimize-o3-understands-free.html

My philosophy is to catch errors as early as possible by including sanity checks, but doing it unobtrusively.  Make it easy to ignore if you're looking for the code's basic algorithm, but easy to understand if you are analyzing the error handling.  I think example 3 to be approximately as readable as the first example, and as robust as the second.  It is important to keep the error handling code visually small so that it can be appended to the ends of lines.  This:

      foo->state_filep = fopen("foo.state", "rw");  ASSRT(foo->state_filep!=NULL);

is much cleaner than this:

      foo->state_filep = fopen("foo.state", "rw");
      if (foo->state_filep) {
          blah blah blah
      }

As you've probably guessed, the ASSRT macro does nothing if the expression is true, or does a "goto ASSRT_FAIL;" if the expression is false.  A variation is to use "abort()" instead of goto.  In many ways this is FAR superior for a troubleshooter; a core file can provide a full stack trace and access to program state.  Calling "abort()" also simplifies the code further, removing explicit clean up.  But it also means that the code has given up on any hope of recovery.  Differentiating between an un-recoverable error and a recoverable error is often not possible at the point where the error is detected; it is usually considered good form to report the error and return a bad status.

(FYI - the "memset()" before the "free()" is commented as "poison the freed structure".  The purpose of this is to make sure that any dangling references to the foo object will seg fault the instant they try to de-reference one of the internal pointers.  Again, my philosophy is to find bugs as soon as possible.)


ASSRT()

What is this magical "ASSRT()" macro?

  #define ASSRT(cond_expr) do { \
      if (!(cond_expr)) { \
          char errmsg[256]; \
          snprintf(errmsg, sizeof(errmsg), "%s:%d, failed assert: (%s)", \
              __FILE__, __LINE__, #cond_expr); \
          log_error(errmsg); \
          goto ASSRT_FAIL; \
      }  \
  } while (0)

For those unfamiliar with C macro magic, there are two bits here that might benefit from explanation.  First the "do ... while (0)".  This is explained pretty well here.  It's the standard way to avoid tricky bugs when macros are used as part of if statements.  The second is the "#cond_expr" at the end of the sprintf.  The C preprocessor converts "#macro input parameter" into a C string.

So, the macro expansion of "new_foo = malloc(sizeof(foo_t));  ASSRT(new_foo!=NULL);" is:

      new_foo = malloc(sizeof(foo_t));  do {
          if (!(new_foo!=NULL)) {
              char errmsg[256];
              snprintf(errmsg, sizeof(errmsg), "%s:%d, condition false: '%s'",
                  "foo.c", 50, "new_foo!=NULL");
              log_error(errmsg);
              goto ASSRT_FAIL;
          }
      } while (0);

An important reason for making this a macro instead of a function is so that the __FILE__ and __LINE__ compiler built-ins apply to the source line which uses ASSRT.  I've also sometimes enhanced the ASSRT macro, making it specific to a program or even a function by including prints of interesting state information.  I've also sometimes included errno.  These improvements help the troubleshooter diagnose the internal problem.

One big drawback to this approach?  The error messages logged can only be loved by a programmer. What would an end user do with this?

  Error logger: foo.c:50, failed assert: (new_foo!=NULL)

This is wonderful for a programmer; I don't know how many times I've seen an error log containing, "could not open file", only to find that the string appears many times in the source code.  This pin-points the source file and line.  However, this message is virtually useless to an end user.

On the other hand, is it really any worse than this?

  Error logger: could not malloc foo

This looks less like geek-speak, but doesn't actually help an end user any more.  In both cases, the user has little choice but to complain to the developer.

My point is that including an error-specific, highly-descriptive message makes the error handling code longer and more-intrusive.  One goal of mine is to keep the ASSRT on the same line as the line it's checking; hard to do if including a descriptive message.  But as soon as it overflows onto subsequent lines, it interferes with the logical flow of the code, making it harder to read.  You figure, since internal errors are usually the result of a software bug, it's not so bad to log errors that only a programmer could love - only programmers need to deal with them, users only need to accurately report them.  (Which reminds me of a war story.)

I have an evolved version of this approach, which I plan to clean up and present in the not-too-distant future.


JAVA

Java adds exception handling, which some would say contradicts my claim that error handling isn't standardized.  However, Java exceptions merely standardize the low-level language tool.  It doesn't really address the approach that a program takes.  When used poorly (e.g. try-catch at every call), it reduces to the unreadability of example 2.  When done well, it approximates the good readability of example 3.  The key is the sacrifice some user-friendliness of error message in favor of keeping the program structure clear.  That sacrifice is worth it since internal errors should rarely happen.


C++

I am a bit embarrassed to admit that I'm not very familiar with C++'s exception handling facility.  I've never worked on a project that used it.  I've heard many programmers state their strongly-held opinion that C++ exceptions are to be avoided at all costs, but I don't have enough personal experience to have my own opinion.  One concrete data point: a computer engineering senior at U of I (a respected institution) tells me that their C++ course does not teach exceptions.

But I figure that even if C++ exceptions are not up to the task, my C method should work fine.  I'm not married to any specific language tool, so long as error handling can be added without obscuring the code.


P.S.

I risk being haunted by the ghost of Dijkstra by using a dreaded goto.  Maybe Knuth will protect me (alternate).  :-)


P.P.S.

I have a slightly different version in my wiki.  I change the goto to abort(), and I also print the human-readable form of errno.  While fine for most Unix tools, those changes will not be appropriate for all kinds of programs, so the form above is more general.

Saturday, June 7, 2014

Grace Hopper documentary (crowd funding)

I suffer from a little bit of hero worship for Grace Hopper (1906 - 1992).  She is often credited with writing the first compiler in 1952, and was an important contributor to the development of COBOL.

There is a crowd-funded effort to produce a documentary on her:

http://www.indiegogo.com/projects/born-with-curiosity-the-grace-hopper-documentary

I would like to see this documentary succeed because I really would like to know more about her than appears in Wikipedia.

Update: Grace Hopper on Letterman:
    https://www.youtube.com/watch?v=1-vcErOPofQ
She really was charming.

Thursday, June 5, 2014

clang/llvm optimize -O3 understands free()


Wow, I just found an amazing bug, although I'm not "bug" is the right word for it.

I have an object delete function:

void e_del(e_t *e)
{
    if (e == NULL || e->signature != E_SIGNATURE) abort();
    e->signature = 0;
    free(e);
}

I have int signature at the end of the e_t structure, set at object creation to a known value (E_SIGNATURE).  Then, I can check periodically to make sure that a valid object is being used.  This detects a variety of bugs, like writing past the end of a buffer in the e_t structure, or just an invalid pointer.  The setting of signature to zero in e_del() is to detect a double call to e_del() with the same object.

In my unit tests, I coded a double-delete to make sure it worked, but abort didn't get called!  Instead, it double-freed!

Silly me, I must have a stupid error in my logic.  Re-build without optimization to make it easier to set breakpoints and single step.  Works.  Works fine.  Without optimization, my code works.  With optimize level 3, it doesn't.  I generated assembly language for optimize and non-optimize cases, and guess what.  At optimize 3, the zero of signature is gone.  Completely gone.

I had a theory, which I tested by changing free(e) to printf("%p", e) and guess what.  At optimize 3, the zero of signature reappeared.  The code worked as expected.  Change printf back to free, and the zero of signature disappeared again.

Apparently, clang/llvm knows what free() is for.  It knows that only a buggy program would try to access a heap segment after it is freed.  So for any non-buggy (and non-multi-threaded) program, it is safe to throw away the "e->signature = 0" line.   My problem is that I DON'T want to assume a bug-free program.  I want to do a good job of detecting bugs.

How to fix?  A variation on Linux's ACCESS_ONCE macro which I named VOL:

#define VOL(_x, _type) (*(volatile _type *)&(_x))
. . .
VOL(e->signature, int) = 0;
free(e);

This "volatilizes" access to e->signature, forcing the optimizer to leave the operation alone.  (FYI - my reason for making my own macro is that ACCESS_ONCE uses the gcc-specific built-in "typeof()" to remove the need to pass in the data type.  But I didn't want my code to be gcc-specific.)

I've done a lot of multi-threaded programming in my day, most-recently experimenting with lock-free, non-blocking designs, so I've had to worry about the optimizer from time to time.  But this is the first time I've ever had a simple, single-threaded, straight-line code work wrong due to the optimizer.  Think about what this means.  The optimizer knows what free() is for, knows that the freed segment shouldn't be accessed after the free, and makes code generation decisions on that basis.  I wonder what else the optimizer thinks it knows...

UPDATE: in its first posting, mistakenly attributed this issue to gcc.  I forgot that my Mac now uses the clang/llvm compiler.  I tried this with true gcc, and it did NOT omit the clear of signature.

UPDATE2: I changed the assignment of zero to signature to a memset of the whole structure:

void e_del(e_t *e)
{
    if (e == NULL || e->signature != E_SIGNATURE) abort();
    memset(e, 0, sizeof(*e));
    free(e);
}

The compiler did NOT optimize out the memset.  I'm guessing that it thought the memset function might have side effects, which is interesting since memset is a compiler builtin.  In fact, if signature is the only field in e_t, the memset compiles to the exact same instructions as the zero assignment.

UPDATE3: Wow, optimizers are surprising.  I found a case where the memset() IS optimized away.  I tried telling the compiler that e points at volatile memory, and it STILL optimized it away.   I'm back to using VOL() on the individual fields.  Again, this is all clang/llvm on Mac.  I've never seen gcc optimize away code based on knowing what free() does.