Friday, September 12, 2014

Clisp Performance

In my previous post, I did some speed measurements of Clojure.  John commented with a test in common lisp, which performed MUCH better than Clojure.

So I did some more experimentation.  Based on John's "fact2" results, his laptop appears to be about 3x faster than mine; not surprising since I have a 2nd gen Mac Air, and he probably has a real computer :-).  But when I ran John's fact4 with clisp, I got:

(time (fact4 1000))
Real time: 0.003765 sec.
Run time: 0.003765 sec.

Whoa!  That's over 200 times slower!  I tried it again.  And again.  And again.  Same results.

Then I tried (fact4 10000) and got "APPLY: too many arguments given to +".  Oh yeah, now I remember that about clisp.  :-)

Then I tried John's trick of timing a loop of them:

(time (dotimes (i 1000 (fact4 1000))))
Real time: 0.005146 sec.
Run time: 0.005117 sec.

Each run of (fact4 1000) now took 5 microseconds. I could imagine some overhead which needs to be amortized across multiple loopings, but that seems extreme.

Look at this:

(progn
  (time (dotimes (i 1 (fact4 1000))))
  (time (dotimes (i 10 (fact4 1000))))
  (time (dotimes (i 100 (fact4 1000))))
  (time (dotimes (i 1000 (fact4 1000)))))
Real time: 0.003783 sec.
Run time: 0.003783 sec.
Space: 8760 Bytes
Real time: 0.002432 sec.
Run time: 0.002394 sec.
Space: 8760 Bytes
Real time: 0.002528 sec.
Run time: 0.002528 sec.
Space: 8760 Bytes
Real time: 0.003901 sec.
Run time: 0.003832 sec.
Space: 8760 Bytes

They ALL took approximately the same time!  Is clisp somehow not doing the loop?  Maybe it sees that there are no side-effects, and just optimizes out all but one run?  Maybe it implicitly does Clojure's "memoize" function of caching results?  I kept going:

(progn
  (time (dotimes (i 1000 (fact4 1000))))
  (time (dotimes (i 10000 (fact4 1000))))
  (time (dotimes (i 100000 (fact4 1000))))
  (time (dotimes (i 1000000 (fact4 1000)))))
Real time: 0.004611 sec.
Run time: 0.00461 sec.
Space: 8760 Bytes
Real time: 0.017613 sec.
Run time: 0.01703 sec.
Space: 8760 Bytes
Real time: 0.126134 sec.
Run time: 0.125795 sec.
Space: 8760 Bytes
Real time: 1.231531 sec.
Run time: 1.230281 sec.
Space: 8760 Bytes

FINALLY!  The times start increasing by factors of 10 after about 10,000 runs, settling down to about 1.2 microseconds per "(fact4 1000)".  John - I suspect that you would see even better performance if you did more loops.

Anyway, I went back to the Clojure programs and did the same thing with timing multiple loopings.

(do
  (time (dotimes [i 1] (fact1 1000)))
  (time (dotimes [i 10] (fact1 1000)))
  (time (dotimes [i 100] (fact1 1000)))
  (time (dotimes [i 1000] (fact1 1000))))
"Elapsed time: 0.396 msecs"
"Elapsed time: 2.413 msecs"
"Elapsed time: 39.325 msecs"
"Elapsed time: 250.707 msecs"

The timings are basically consistent with a 10-fold increase in time with each 10-fold increase in loops.  Yes, there is variability (presumably due to GC), but multiple runs have tended to prove the rule.

I get the same trends with fact2 and fact3.  So my originally-posted timings were about right.

Conclusion?  I've spent WAY too much time on this issue this morning and I'm going to be late to work.  I think I'll just skip my morning exercise and only be a little late.  :-)

(Update: my original C-language timings were wrong.  I've corrected this post based on correct C timings.  I no longer mistrust the "clisp" results, and have removed claims to that effect.)

Update 2: now that I trust the timing, I did a bit more number crunching. I'll assume that clisp can run a single call to (fact4 1000) in 1.2 microseconds. There is obviously some kind of overhead associated with the first call. (Maybe clisp is compiling "fact4" just in time?  I.e. after "time" took the start timestamp?) Let's say that it takes 3500 microseconds for that overhead. So, with varying number of loops, I would expect to see the following:

3500 + 1.2 * 1 = 3501.2
3500 + 1.2 * 10 = 3512
3500 + 1.2 * 100 = 3620
3500 + 1.2 * 1000 = 4700
3500 + 1.2 * 10000 = 15500
3500 + 1.2 * 100000 = 123500
3500 + 1.2 * 1000000 = 1203500

Given the wide variability of individual timings, this matches reasonably well the results above. I don't think clisp is doing anything too fishy.

3 comments:

John Jacobsen said...

Sorry to make you skip your exercise!

I used SBCL (in an Emacs SLIME REPL), and the times seem to scale roughly linearly with the number of iterations.

John Jacobsen said...

One more thing - did you check that your CLISP function is actually returning the correct result? If it's raising an exception, that might explain the relatively constant times.

Steve Ford said...

Heh. Regarding checking the result, no I didn't. And neither did you. :-) Since the lisp code uses "below", it never adds in the final value. I.e. (fact4 1000) doesn't add in the last 1000, and therefore returns 499500.

But that doesn't have a measurable effect on the timing. And I can just call it with 1001 to get the identical answer.

Anyway, it appears that the code is working. I'll do some more playing around over the weekend, both with clisp and I'll try out SBCL.