Tuesday, September 23, 2014

Tail Call Recursion in Clojure

In this article, I explain tail call recursion, show an example in both Clojure and common lisp, opine that Clojure's approach to tail call recursion optimization is better than lisp's, and finally look at Clojure's "lazy-seq" which seems to mysteriously provide the benefits of tail call recursion optimization for algorithms which are *not* tail call recursive.

Note: in this article, I refer to "tail call recursion", but I am actually dealing with a subset of that domain: tail *self* call (a.k.a. "direct") recursion. The theories presented also apply to the general case, including mutual (a.k.a. "indirect") recursion. However, optimization of tail *mutual* call recursion is much more rare in compilers than optimization of tail *self* call recursion. So I'm only considering self call.

What Is Tail Call Recursion?

Tail call recursion is a special case of recursion. With recursion, a function calls itself. *Tail* call recursion is where the function calls itself, and then immediately returns, either returning nothing, or returning the value of the recursive call. I.e. it doesn't call itself, and then do something with the results recursive call's value before returning. The reason tail call recursion is an interesting special case is that it can be optimized in a way that doesn't require a stack frame for each recursive invocation (i.e. it converts it into a simple loop). This makes it faster and conserves stack space.

Take the most-obvious way to program factorial recursively in clojure:

(defn fact1 [x]
  (if (<= x 1)
    (* x (fact1 (dec x)))))

(fact1 5) ;= 120N
(fact1 9000) ;= StackOverflowError

This is not tail call recursive. If you call (fact 5), that initial invocation needs to wait for the recursive invocation to return the value of 4! so that it can multiply it by 5 and return the result. It needs to maintain the state of x across the recursive invocation, and it uses the stack for that. Clojure's stack seems awfully small to me; less than 9000 levels deep.

It's easy to modify fact1 to be tail recursive. It requires a modified version in which you pass a "result so far" as an additional input parameter:

(defn fact2-tail [x result-so-far]
  (if (<= x 1)
    (fact2-tail (dec x) (* x result-so-far))))

(fact2-tail 5 1N) ;= 120N
(fact2-tail 9000 1N) ;= StackOverflowError

(If you don't like having to pass the "1N", you can simply define a wrapper function.) This is tail call recursive because all the work, including the multiplication, is done before the recursive call.  So theoretically no state needs to be stacked up as the recursion goes deeper.  But as you can see by the stack overflow, Clojure doesn't realize this.

Be aware that some lisps will do automatic tail call optimization (or at least tail self-call optimization). I tried the equiv thing with clisp and it worked fine with big values:

 (defun fact2-tail-lisp (x result-so-far)
   (if (<= x 1)
     (fact2-tail-lisp (1- x) (* x result-so-far)))))

(fact2-tail-lisp 59999 1) ;= 260,630-digit number starting with 2606

So what's a Clojure programmer to do? Use the loop/recur construct. Let's convert fact1, which is NOT tail call recursive, to loop/recur:

(defn fact1-loop [x]
  (loop [i x]
    (if (<= i 1)
      (* i (recur (dec i))))))
CompilerException java.lang.UnsupportedOperationException: Can only recur from tail position

Interesting. Clojure is NOT smart enough to recognize tail call recursion in fact2-tail, but it is smart enough to recognize NON-tail call recursion in fact1_loop. So the Clojure programmer must both recode the algorithm to be tail call recursive and code it to use loop/recur:

(defn fact2-loop [x]
  (loop [i x, result-so-far 1N]
    (if (<= i 1)
      (recur (dec i) (* i result-so-far)))))

(fact2-loop 5) ;= 120N
(fact2-loop 59999) ;= 260,630-digit number starting with 2606

There! One advantage to using loop/recur is you don't need a separate function which takes the extra result-so-far parameter. It's kind of built in to the loop construct.

Which Is Better?

So, which is better: Clisp, which automatically did the optimization, or Clojure, where the programmer basically has to *tell* the compiler to optimize it with loop/recur?

I have to side with Clojure, and I'll tell you why. I started my experimentations knowing that lisp did the optimization automatically. I coded it up, and it overflowed the stack. Huh? I had to do a bunch of googling to discover this page which mentions in footnote 1 that you have to compile the code to get it to do the optimization. Had I not tried it with a large number, or if I had mis-coded the algorithm to not really be tail call recursive, then I would have put in some effort to achieve an intended benefit, would not have received that benefit, and I might not have known it.

Most of the time, you need to carefully code your algorithm to be tail call recursive. I prefer Clojure's approach where you tell the compiler that you intend it to perform the optimization, and it gives you an error if it can't, as opposed to lisp which silently leaves it un-optimized if you make a mistake.  It also gives a hint to a future maintainer that your potentially awkward algorithm is coded that way for a reason (admit it: fact2-tail and fact2-loop are more awkward algorithms than fact1).

What About lazy-seq?

Now I'm experimenting with lazy-seq, which can also be used to avoid stack consumption. It's a different animal -- the factorial function only returns a single value, whereas lazy-seq is intended to assemble a list (or more-accurately, a lazy sequence).

Let's start with re-writing fact1 to return a list of all factorials, from 1 to the requested value:

(defn fact3 [x i result-so-far]
  (if (> i x)
     (* i result-so-far)
     (fact3 x (inc i) (* i result-so-far)))))

(fact3 5 1N 1N) ;= (1N 2N 6N 120N)
(last (fact3 5 1N 1N)) ;= 120N
(last (fact3 9000 1N 1N)) ;= StackOverflowError

This function is recursive but doesn't use loop/recur. No wonder it blew up the stack. To convert this to loop/recur requires a change in the output.  The final return value has to be the entire list, so the "cons"ing needs to be done on-the-fly each time through the loop. Thus, larger factorial values are "cons"ed onto a list-so-far of smaller factorial values, resulting in a reversed list:

(defn fact3-loop [x]
  (loop [i 2N, result-so-far '(1N)]
    (if (> i x)
      (recur (inc i) (cons (* i (first result-so-far)) result-so-far)))))

(fact3-loop 5) ;= (120N 24N 6N 2N 1N)
(first (fact3-loop 5) ;= 120N
(first (fact3-loop 9000)) ;= 31,682-digit number starting with 8099
(first (fact3-loop 59999)) ;= OutOfMemoryError  Java heap space

We solved the stack problem, but ran out of heap assembling the massive list (note that most of the elements of the list hold VERY large numbers, which consumes significant space for each list element).

lazy-seq to the rescue! Here's it is in action:

(defn fact3-lazy [x i result-so-far]
  (if (> i x)
     (* i result-so-far)
     (lazy-seq (fact3-lazy x (inc i) (* i result-so-far))))))

(fact3-lazy 5 1N 1N) ;= (1N 2N 6N 24N 120N)
(last (fact3-lazy 5 1N 1N) ;= 120N
(last (fact3-lazy 9000 1N 1N) ;= 31,682-digit number starting with 8099
(last (fact3-lazy 59999 1N 1N) ;= 260,630-digit number starting with 2606

Look at the algorithm; it's almost identical to fact3. I.e. it is *not* tail call recursive! Each recursive call must return its list, which is then "cons"ed onto (* i result-so-far).

The lazy-seq function allows Clojure to conserve stack, even though the algorithm is not tail call recursive. It is also able to reclaim unneeded early list items (I'm only interested in the last one), so it conserves heap.

Did Clojure figure out a way to not have to maintain state across all 59999 invocations? Or is it just doing it with fact3-lazy in a more-space-efficient way than fact3? I'm guessing that it is maintaining state on the heap instead of the stack, but I'll be honest, I'm not exactly sure what's going on.

Monday, September 22, 2014

The Meditative Coder

I changed the name of my blog from "Sford's Blog" to "The Meditative Coder".

A bit pretentious or pompous? I hope not. I do think about code a lot.

I'm not talking about design. Lots of people could claim to be a meditative designer, striving to make their structure and algorithm are elegant, efficient, and robust. I cut my professional teeth in a culture where coding was considered a low art. Design was where the prestige lay. And architecture? That gets you a corner office and groupies. I'm not kidding! I've seen some pretty mediocre programmers put on a pedestal when they achieved architect status.

I don't want to minimize the importance of a good architectural framework and a good design. However, I think coding sometimes gets short shrift. I think that well laid-out  and organized code is beautiful. Given the choice, I would rather maintain a program whose design is so-so and whose code is excellent, than vice versa.

I will admit to spending significant time after I've gotten a program working, just to make the code more readable -- changing variable and function names to be more-precisely descriptive, adding or subtracting whitespace to make the code clauses more logical, even re-factoring the code a bit to have cleaner, more-intuative abstractions, and even adding comments (judiciously).

So yeah, I think I'm a pretty meditative coder, perhaps more-so than average.

Finally, while "Sford's Blog" appeals to my minimalist aesthetic, it was both correct and useless. Besides, "meditative coder" got absolutely NO hits on Google; how can I resist? :-)

Friday, September 12, 2014

Clisp Performance

In my previous post, I did some speed measurements of Clojure.  John commented with a test in common lisp, which performed MUCH better than Clojure.

So I did some more experimentation.  Based on John's "fact2" results, his laptop appears to be about 3x faster than mine; not surprising since I have a 2nd gen Mac Air, and he probably has a real computer :-).  But when I ran John's fact4 with clisp, I got:

(time (fact4 1000))
Real time: 0.003765 sec.
Run time: 0.003765 sec.

Whoa!  That's over 200 times slower!  I tried it again.  And again.  And again.  Same results.

Then I tried (fact4 10000) and got "APPLY: too many arguments given to +".  Oh yeah, now I remember that about clisp.  :-)

Then I tried John's trick of timing a loop of them:

(time (dotimes (i 1000 (fact4 1000))))
Real time: 0.005146 sec.
Run time: 0.005117 sec.

Each run of (fact4 1000) now took 5 microseconds. I could imagine some overhead which needs to be amortized across multiple loopings, but that seems extreme.

Look at this:

  (time (dotimes (i 1 (fact4 1000))))
  (time (dotimes (i 10 (fact4 1000))))
  (time (dotimes (i 100 (fact4 1000))))
  (time (dotimes (i 1000 (fact4 1000)))))
Real time: 0.003783 sec.
Run time: 0.003783 sec.
Space: 8760 Bytes
Real time: 0.002432 sec.
Run time: 0.002394 sec.
Space: 8760 Bytes
Real time: 0.002528 sec.
Run time: 0.002528 sec.
Space: 8760 Bytes
Real time: 0.003901 sec.
Run time: 0.003832 sec.
Space: 8760 Bytes

They ALL took approximately the same time!  Is clisp somehow not doing the loop?  Maybe it sees that there are no side-effects, and just optimizes out all but one run?  Maybe it implicitly does Clojure's "memoize" function of caching results?  I kept going:

  (time (dotimes (i 1000 (fact4 1000))))
  (time (dotimes (i 10000 (fact4 1000))))
  (time (dotimes (i 100000 (fact4 1000))))
  (time (dotimes (i 1000000 (fact4 1000)))))
Real time: 0.004611 sec.
Run time: 0.00461 sec.
Space: 8760 Bytes
Real time: 0.017613 sec.
Run time: 0.01703 sec.
Space: 8760 Bytes
Real time: 0.126134 sec.
Run time: 0.125795 sec.
Space: 8760 Bytes
Real time: 1.231531 sec.
Run time: 1.230281 sec.
Space: 8760 Bytes

FINALLY!  The times start increasing by factors of 10 after about 10,000 runs, settling down to about 1.2 microseconds per "(fact4 1000)".  John - I suspect that you would see even better performance if you did more loops.

Anyway, I went back to the Clojure programs and did the same thing with timing multiple loopings.

  (time (dotimes [i 1] (fact1 1000)))
  (time (dotimes [i 10] (fact1 1000)))
  (time (dotimes [i 100] (fact1 1000)))
  (time (dotimes [i 1000] (fact1 1000))))
"Elapsed time: 0.396 msecs"
"Elapsed time: 2.413 msecs"
"Elapsed time: 39.325 msecs"
"Elapsed time: 250.707 msecs"

The timings are basically consistent with a 10-fold increase in time with each 10-fold increase in loops.  Yes, there is variability (presumably due to GC), but multiple runs have tended to prove the rule.

I get the same trends with fact2 and fact3.  So my originally-posted timings were about right.

Conclusion?  I've spent WAY too much time on this issue this morning and I'm going to be late to work.  I think I'll just skip my morning exercise and only be a little late.  :-)

(Update: my original C-language timings were wrong.  I've corrected this post based on correct C timings.  I no longer mistrust the "clisp" results, and have removed claims to that effect.)

Update 2: now that I trust the timing, I did a bit more number crunching. I'll assume that clisp can run a single call to (fact4 1000) in 1.2 microseconds. There is obviously some kind of overhead associated with the first call. (Maybe clisp is compiling "fact4" just in time?  I.e. after "time" took the start timestamp?) Let's say that it takes 3500 microseconds for that overhead. So, with varying number of loops, I would expect to see the following:

3500 + 1.2 * 1 = 3501.2
3500 + 1.2 * 10 = 3512
3500 + 1.2 * 100 = 3620
3500 + 1.2 * 1000 = 4700
3500 + 1.2 * 10000 = 15500
3500 + 1.2 * 100000 = 123500
3500 + 1.2 * 1000000 = 1203500

Given the wide variability of individual timings, this matches reasonably well the results above. I don't think clisp is doing anything too fishy.

Thursday, September 11, 2014

Clojure performance

There is a cute site www.4clojure.com which poses coding challenges to Clojure programmers.  I don't have  lot of time each day, so I'm only up to #42, but I'm having fun.  The actual problems have been trivial so far, targeted to giving a beginner practice with Clojure features.  One nice thing about it is that you can "follow" somebody else to see what their solutions are (but you can only see them *after* you've solved it yourself).  I've found a pretty experienced Clojure programmer to follow, so it's been very educational to compare my sometimes-brute force solutions with his more elegant solutions.

For example, problem #42 is to write a factorial function.  For the purposes of this post, I'm changing the factorial algorithm to do addition rather than multiplication.  This is because I wanted to test it with large numbers without arithmetic overflow to get timing information.

As a long-time C programmer, I would just write:

long long fact (long long n)
  long long r = 1;
  long long i;
  for (i = 2; i <= n; i++)
    r += i;    /* not really factorial */
  return r;

I timed that (on my 2nd gen Mac Air) for n=1,000 and it took 6 nanoseconds.

A beginning Clojure functional programmer might do it recursively like this:

(defn fact1 [n]
  (if (= n 1)
      (+ n (fact1 (dec n)))))

For n=1,000, this takes about 330 microseconds - about 55,000 times longer than C.  :-(

The more-experienced guy solved it this way:

(defn fact2 [n]
  (apply + (range 1 (inc n))))

This is very elegant and uses some of Clojure's interesting features.  As soon as I saw it, I thought, "wow, that probably took him 10 seconds to write."  It executes in almost exactly the same time as the recursive version - about 330 micros.

Here is how I wrote it:

(defn fact3 [n]
  (loop [sum-so-far 1, i n]
    (if (<= i 1)
      (recur (+ sum-so-far i) (dec i)))))

Note that this is arguably the most-complex of all the solutions.  Part of the reason I wanted to write it this way was that I wanted to practice a method called "tail recursion".  In this method, the algorithm is technically recursive (hence the "recur" function), but the compiler can trivially convert it into a simple loop.  I.e. this method does not actually create 1000 stack frames.  The reason for doing this is to make it more efficient.

My version takes about 200 microseconds, only 33,000 times longer than C.  :-)

What happens if we go with n=10,000?  Surprise!  The purely-recursive version gets a stack overflow.  (That seems like an awfully small stack to me...)  The other versions each take about 10 times longer, as you would expected.

The moral of the story?  Use Clojure when you want short development cycles, not high performance.  And by that measure, the *real* winner is that more-experienced Clojure programmer who wrote "(apply + (range 1 (inc n)))".  No "if" statements to get wrong, no partial-sum to keep track of, no stacks to overflow.  By a HUGE margin, the fastest to get to a working program.

(Update: my first posting used mistakenly used C timing without optimizing.  I've corrected this article with timings using "-O3" optimization.  The optimizer gave a 600x speed improvement, from 4 microseconds to 6 nanoseconds.  Wow!)

Monday, September 1, 2014

Clojure tools: bumps on the road

I've been enjoying getting my tools sharpened and ready for use.  John Jacobsen describes a nice set of workflow habits and tools that he uses in his blog.  It includes Emacs, Leiningen, and Cider.

27-Sep-2014: I've made in-place updates to this post.

lein print and println: no longer broken

Using the Clojure REPL in Emacs via Cider, it mostly worked, but the "print" and "println" functions were broken at one point:

    user> (print 1)
    1NoSuchMethodError clojure.tools.nrepl.StdOutBuffer.length()I ...

Good old Google.  A search for the above reveals a Leiningen issue which indicates the "stable" version 2.4.3 has a problem.  When this post was first written, the fix was not yet available. Fortunately, there was a workaround: downgrade to version 2.4.2:

    lein upgrade 2.4.2

This is no longer necessary: lein 2.5 is fixed:

    user> (print 1)

Emacs packages: chicken-and-egg problem?

To get a productive Emacs-based Clojure environment, you need Emacs packages.  I had some initial trouble with it not finding the Marmalade packages I asked for.  I read some advice telling me to refresh the packages, but it wasn't exactly clear how to do it all in one go.  To complicate matters, I found myself trying out the approaches of several bloggers, leading me to need to clear my config and start over more than once.

(I tried "prelude" Emacs package, but it was rather wonky and VERY slow to start up, so gave up and went with a bunch of stuff in John's config.)

So I created "refresh_pkgs.el" containing the following:

    (require 'package)
    (add-to-list 'package-archives
             '("marmalade" . "http://marmalade-repo.org/packages/"))

When I want to clear everything and start over, I do this:

    mv .emacs.d .emacs.d.sav
    mv .emacs .emacs.sav
    emacs -nw -q -l emacs_init/refresh_pkgs.el

When the welcome screen is displayed, exit Emacs with C-x C-c.  A new ".emacs.d" directory now exists with a refreshed Marmalade.  Now create the desired "init.el" (in the ".emacs.d" dir) and enter Emacs again.  This invocation will take a lot longer because it downloads and builds the desired packages.  Once again, when the welcome screen is finally displayed, exit Emacs.  You should be good to go.  (Note: I suspect that the "emacs -nw -q -l emacs_init/refresh_pkgs.el" command will be useful without starting over if there are new packages I want to try out.)

Here's the package loading section of my "init.el":

    (require 'package)
    (add-to-list 'package-archives
                 '("marmalade" . "http://marmalade-repo.org/packages/"))
    (defvar my-packages '(better-defaults
    (dolist (p my-packages)
      (when (not (package-installed-p p))
        (package-install p)))

I won't show the rest because I'm sure I'll be making a lot of changes over the weeks to come.

emacs: melpa v.s. marmalade

Ugh.  For various reasons, I combined marmalade with melpa.  The net result was a nightmare of things not working quite right.  Apparently when both are enabled, the above package list will fetch incompatible versions of different packages.  I tried going with melpa-only, and several packages were not available.  I could have messed with it for longer, but I just wanted things to work again.  So I started from scratch again with the above settings.  With the latest lein, I'm back to everything working.