Thursday, September 11, 2014

Clojure performance

There is a cute site which poses coding challenges to Clojure programmers.  I don't have  lot of time each day, so I'm only up to #42, but I'm having fun.  The actual problems have been trivial so far, targeted to giving a beginner practice with Clojure features.  One nice thing about it is that you can "follow" somebody else to see what their solutions are (but you can only see them *after* you've solved it yourself).  I've found a pretty experienced Clojure programmer to follow, so it's been very educational to compare my sometimes-brute force solutions with his more elegant solutions.

For example, problem #42 is to write a factorial function.  For the purposes of this post, I'm changing the factorial algorithm to do addition rather than multiplication.  This is because I wanted to test it with large numbers without arithmetic overflow to get timing information.

As a long-time C programmer, I would just write:

long long fact (long long n)
  long long r = 1;
  long long i;
  for (i = 2; i <= n; i++)
    r += i;    /* not really factorial */
  return r;

I timed that (on my 2nd gen Mac Air) for n=1,000 and it took 6 nanoseconds.

A beginning Clojure functional programmer might do it recursively like this:

(defn fact1 [n]
  (if (= n 1)
      (+ n (fact1 (dec n)))))

For n=1,000, this takes about 330 microseconds - about 55,000 times longer than C.  :-(

The more-experienced guy solved it this way:

(defn fact2 [n]
  (apply + (range 1 (inc n))))

This is very elegant and uses some of Clojure's interesting features.  As soon as I saw it, I thought, "wow, that probably took him 10 seconds to write."  It executes in almost exactly the same time as the recursive version - about 330 micros.

Here is how I wrote it:

(defn fact3 [n]
  (loop [sum-so-far 1, i n]
    (if (<= i 1)
      (recur (+ sum-so-far i) (dec i)))))

Note that this is arguably the most-complex of all the solutions.  Part of the reason I wanted to write it this way was that I wanted to practice a method called "tail recursion".  In this method, the algorithm is technically recursive (hence the "recur" function), but the compiler can trivially convert it into a simple loop.  I.e. this method does not actually create 1000 stack frames.  The reason for doing this is to make it more efficient.

My version takes about 200 microseconds, only 33,000 times longer than C.  :-)

What happens if we go with n=10,000?  Surprise!  The purely-recursive version gets a stack overflow.  (That seems like an awfully small stack to me...)  The other versions each take about 10 times longer, as you would expected.

The moral of the story?  Use Clojure when you want short development cycles, not high performance.  And by that measure, the *real* winner is that more-experienced Clojure programmer who wrote "(apply + (range 1 (inc n)))".  No "if" statements to get wrong, no partial-sum to keep track of, no stacks to overflow.  By a HUGE margin, the fastest to get to a working program.

(Update: my first posting used mistakenly used C timing without optimizing.  I've corrected this article with timings using "-O3" optimization.  The optimizer gave a 600x speed improvement, from 4 microseconds to 6 nanoseconds.  Wow!)


John Jacobsen said...

Interesting post! Just for comparison, I tried it in Common Lisp:

(defun fact4 (n)
(apply #'+ (loop for i below n collect i)))

On my laptop, (fact4 1000) runs in about 17 µsec, whereas the (fact2 1000) (the Clojure version) takes 95 µsecs. (In each case I ran the test 1000 times and divided by 1000 to improve accuracy a bit).

I suspect that a more experienced Common Lisp programmer could do even better.

Steve Ford said...

Which common lisp implementation did you use? I normally use clisp, and there's something fishy going on. See my next post.

John Jacobsen said...

I used SBCL -- see my comments on your other post.

Steve Ford said...

Well, color me embarrassed. My C program which I measured at 4 microseconds to run fact(1000) -- I forgot to compile it with optimization. No wonder it was slower than lisp.

I recompiled it at -O3 and it ran fact(1000) in 6 nanoseconds - over 600 times faster. Memory accesses are expensive!

So, given that clisp can do one (fact4 1000) in 1.2 microseconds, I can say that clisp is 200 times slower. That sounds believable.

Clojure? Yeah, it's slow.

John Jacobsen said...

Clojure is slower than C for many things, that's for sure.

However, by changing your factorial function to use + instead of *, you are optimizing for problems in C's wheelhouse rather than Clojure's (or Common Lisp's).

If one uses a "real" factorial function, which, in Clojure, looks something like

(defn real-fact [n]
(apply *' (range 1 (inc n))))

and try to, say, count the digits of (real-fact 100000):

(->> 100000

how long does it take in C? In Clojure it takes 41 seconds; in Common Lisp, the equivalent code takes less than 4 seconds. I'm sure you can code something up in C to do this, but I'll get the answer faster this way :-).

Note also Greenspun's Tenth Rule.

Steve Ford said...

> I'm sure you can code something up in C to do this, but I'll get the answer faster this way.

Absolutely. In C I would either have to code up an arbitrary-precision library, or search for an open-source one, download it, figure out how to use it, decide if it is licensed appropriately to be used, and incorporate it into my build system. Either one would take multiple days to accomplish.

I presume it took less than 5 minutes for you to write your clojure version. :-)

Interesting degression. I timed (real-fact 100000) by itself. it took 32 seconds to do the calculation. Then I ran your digit counting program which calculates the result, converts it to a string and counts the characters. That took 104 seconds on my Air. I didn't expect the string conversion to take more than twice the time than the initial calculation, but when I think about it, it makes some sense. The factorial calculation loops 100,000 times. To convert a 456,574-digit number to a string loops THAT many times, with more arithmetic operations per loop to boot. It's a wonder it didn't take even longer.

I hadn't seen Greenspun's Tenth Rule before. :-)