Friday, July 9, 2021

More Perl "grep" performance

In an earlier post, I discovered that a simple Perl program can outperform grep by about double. Today I discovered that some patterns can cause the execution time to balloon tremendously.

I have a new big log file, this time with about 70 million lines. I'm running it on my newly-updated Mac, whose "time" command has slightly different output.

Let's start with this:

time grep 'asdf' cetasfit05.txt
... 39.388 total

time 'asdf' cetasfit05.txt
... 21.388 total

About twice as fast.

Now let's change the pattern:

time grep 'XBT|XBM' cetasfit05.txt
... 24.787 total

time 'XBT|XBM' cetasfit05.txt
... 18.940 total

Still faster, but nowhere near twice as fast. I don't know why 

Now let's add an anchor:

time grep '^XBT|^XBM' cetasfit05.txt
... 25.580 total

time '^XBT|^XBM' cetasfit05.txt
... 3:08.25 total

WHOA! Perl, what happened????? 3 MINUTES???

My only explanation is that Perl tries to implement  a very general regular expression algorithm, and grep implements a subset, and that might cause Perl to be slow in some circumstances. For example, maybe the use of alternation with anchors introduces the need for "backtracking" under some circumstances, and maybe grep doesn't support backtracking. In this simple example, backtracking is probably not necessary, but to be general, Perl might do it "just in case". (Note: I'm not a regular expression expert, and don't really know when "backtracking" is needed; I'm speculating without bothering to learn about it.)

Anyway, let's make a small adjustment:

time '^(XBT|XBM)' cetasfit05.txt
... 17.910 total

There, that got back to "normal".

I guess multiple anchors in a pattern is a bad idea.

P.S. - even though this post is about Perl, I tried one more test with grep:

time grep 'ASDF' cetasfit05.txt
... 26.132 total

Whaaa...? I tried multiple times, and lower-case 'asdf' always takes about 40 seconds, and upper-case 'ASDF' always takes about 27 seconds. I DON'T UNDERSTAND COMPUTERS!!! (sob)

No comments: