In an earlier post, I discovered that a simple Perl program can outperform grep by about double. Today I discovered that some patterns can cause the execution time to balloon tremendously.
I have a new big log file, this time with about 70 million lines. I'm running it on my newly-updated Mac, whose "time" command has slightly different output.
Let's start with this:
time grep 'asdf' cetasfit05.txt
... 39.388 total
time grep.pl 'asdf' cetasfit05.txt
... 21.388 total
About twice as fast.
Now let's change the pattern:
time grep 'XBT|XBM' cetasfit05.txt
... 24.787 total
time grep.pl 'XBT|XBM' cetasfit05.txt
... 18.940 total
Still faster, but nowhere near twice as fast. I don't know why
Now let's add an anchor:
time grep '^XBT|^XBM' cetasfit05.txt
... 25.580 total
time grep.pl '^XBT|^XBM' cetasfit05.txt
... 3:08.25 total
WHOA! Perl, what happened????? 3 MINUTES???
My only explanation is that Perl tries to implement a very general regular expression algorithm, and grep implements a subset, and that might cause Perl to be slow in some circumstances. For example, maybe the use of alternation with anchors introduces the need for "backtracking" under some circumstances, and maybe grep doesn't support backtracking. In this simple example, backtracking is probably not necessary, but to be general, Perl might do it "just in case". (Note: I'm not a regular expression expert, and don't really know when "backtracking" is needed; I'm speculating without bothering to learn about it.)
Anyway, let's make a small adjustment:
time grep.pl '^(XBT|XBM)' cetasfit05.txt
... 17.910 total
There, that got back to "normal".
I guess multiple anchors in a pattern is a bad idea.
P.S. - even though this post is about Perl, I tried one more test with grep:
time grep 'ASDF' cetasfit05.txt
... 26.132 total
Whaaa...? I tried multiple times, and lower-case 'asdf' always takes about 40 seconds, and upper-case 'ASDF' always takes about 27 seconds. I DON'T UNDERSTAND COMPUTERS!!! (sob)