Saturday, July 11, 2020

Perl Faster than Grep

So, I've been crawling through a debug log file that is 195 million lines long. I've been using a lot of "grep | wc" to count numbers of various log messages. Here's some timings for my Macbook Pro:

$ time cat dbglog.txt >/dev/null
real 0m35.423s

$ time wc dbglog.txt
195177935 1177117603 28533284864 dbglog.txt
real 1m44.560s

$ time egrep '999999' dbglog.txt
real 7m39.737s

(For this timing, I chose a pattern that would *NOT* be found.)

On the Macbook, the man page for fgrep claims that it is faster than grep. Let's see:

$ time fgrep '999999' dbglog.txt
real 7m11.365s

Well, I guess it's a little faster, but nothing to brag about.

Then I wanted to create a histogram of some findings, so I wrote a perl script to scan the file and create the histogram. Since it performed regular expression matching on every line, I assumed it would be a little slower than grep, since Perl is an interpreted language.

$ time ./count.pl dbglog.txt >count.out
real 3m9.427s

WOW! Less than half the time!

So I created a simple grep replacement: grep.pl. It doesn't do any histogramming, so it should be even faster.

$ time grep.pl '999999' dbglog.txt
real  2m8.341s

Amazing. Perl grep runs in less than a third the time of grep.

For small files, I bet Perl grep is slower starting up. Let's see.

$ time echo "hi" | grep 9999
real        0m0.051s

$ time echo "hi" | grep.pl 9999
real        0m0.113s

Yep. Grep saves you about 60 milliseconds. So if you had thousands of small files to grep, it might be faster to use grep.



UPDATE:

I got another big log file today (70 million lines) and saw something pretty surprising given my initial findings.


6 comments:

Sahir said...

If you like the idea of grepping with Perl, check out Ack, which is a full featured grep replacement written in Perl.

If you are interested in a faster grep though, ripgrep (rg) is hard to beat! Due to many various tweaks including multithreading, a regex engine that can use SIMD instructions, and built from the ground up unicode support, rg ends up being faster than grep and ack in most cases.

Steve Ford said...

Thanks! I benefitted already from looking at Ack; I didn't know about the File::Next utility, which will easily give me recursive input files. It doesn't integrate with the diamond operator, which is a pity, but will probably be worth it.

BTW, are you the Sahir I've worked with?

Sahir said...

> BTW, are you the Sahir I've worked with?

The very same!

John Jacobsen said...

This is indeed a slightly surprising result.

I tried a similar experiment with Clojure and found that even non-optimized Clojure code was almost as fast as grep.

(defn lazy-file-lines
"
See https://stackoverflow.com/questions/4118123/read-a-very-large-text-file-into-a-list-in-clojure
"
[file]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader file))))

(time (->> "/tmp/big.txt"
lazy-file-lines
(take 1000000)
(filter #(.contains % "66666666"))
doall))
;;=>
"Elapsed time: 10487.314185 msecs"

versus

time head -1000000 /tmp/big.txt | grep 66666666

real 0m7.765s
user 0m7.936s
sys 0m0.333s

(I tried formatting the code prettily but Blogger doesn't like the "" tag.)

This surprises me because grep is presumably written in C and should be "closer to the metal."

If I have more time sometime I might try the same comparison with Common Lisp, which blazes, compared to just about anything except C (see my latest blog post on johnj.com).

Fun post!

Steve Ford said...

> This surprises me because grep is presumably written in C and should be "closer to the metal."

Actually, I'm surprised that it didn't surpass grep, the same way Perl surpassed it. Regular expressions are generally compiled down to a simple FSM, so I think the difference in speed between grep and Perl is how efficient the FSM interpreter is. Perl's maintainers have put a LOT of effort into tuning the FSM interpreter; not so much grep.

IIRC, Clojure doesn't have a built-in Regex compiler/FSM interpreter. Clojure uses the JVM's Regex APIs. I wonder if Java grep would be slower than grep. Maybe Clojure recompiles the pattern for each input line? I doubt it. Or maybe Java hasn't put as much effort into regex efficiency as the Perl maintainers? Plausible. I'm thinking maybe the FSM is efficient enough that inter-line overhead is holding Clojure back.

Are the lines in "big.txt" longish or short? If short, long lines might turn the tables since more time is spent per line inside the FSM.

Regarding Common Lisp, does it have a Regex engine built in? (HA! Grammarly wants "built in" to be hyphenated, but it's not being used as a noun there, so it's wrong!)

John Jacobsen said...

As far as I know, CL-PPCRE is the best option for Common Lisp, but I don't know its performance characteristics -- I may give that a whack sometime soon!

(JVM) Clojure just uses Java for regexes, it is true, though I am not using that under the hood, just the String contains() method, which should be even faster.

The thing that's super fun about doing this sort of test in Common Lisp is that you can disassemble your code and study/tune it rather close to the metal. See, for example, http://johnj.com/from-elegance-to-speed.html

Cheers!
John