Sunday, July 12, 2020

Perl Diamond Operator

As my previous post indicates, I've done some Perl noodling this past week. (I can't believe that was my first Perl post on this blog! I've been a Perl fan for a loooooooong time.)

Anyway, one thing I like about Perl is it takes a use case that tends to be used a lot and adds language support for it. Case in point: the diamond operator "<>" (also called "null filehandle" or "null angle operator").

See the "Tutorial" section below if you are not familiar with the diamond operator.

Tips

I may expand this as time goes on.

Filename / Linenumber

Inside the loop, you can use "$." as the line number and "$ARGV" as the file name of the currently open file.

*BUT*, see next tip.

Continue / Close

Always code  your loop as follows:

while (<>) {
  ...
} continue {
  close ARGV if eof;
}

The continue clause is needed to have "$." refer to the line number within the *current* file. Without it "$." will refer to the total number of lines read so far.

In my opinion, even if what you want is total lines and not line within file, you should still code it like the above and just use your own counter for the total line number. This provides consistency of meaning for "$.". Plus, it's possible that in the future you will want to add functionality that requires line within file, and it's messy to code that with your own counter.

Skip Rest of File

Sometimes you get a little ways into a file and you decide that you're done with the file and would like to skip to the next (if any). Include this inside the loop:

close ARGV;  # skip rest of current file

Positional Parameter

Let's say you're writing a perl version of grep, and you want the first positional parameter (after the options) to be the search pattern.

$ grep.pl "ford" *.txt

Unfortunately, this will try to read a file named "ford" as the first file. What to do?

my $pat = shift;  # Pops off $ARGV[0].
while (<>) {
  ...

This works because "<>" doesn't actually look at the command line. It looks at the @ARGV array. The "shift" function defaults to operating on the @ARGV array.

Security Warning

Because of the way the diamond operator opens files, it is possible for a hostile user to construct a file that can produce very bad results. For example:

$ echo "hello world" >x
$ echo "goodby world" >'rm x|'
$ ls -1
rm x|
x
$ cat *
goodby world
hello world
$ cat x
hello world

So far, so good. "rm x|" is just an unusually-named file with a space in the middle and a pipe ("|") at the end. But now let's use my perl version of grep with a pattern of "." (matches all non-empty lines):

$ grep.pl "." *
Can't open x: No such file or directory at /home/sford/bin/grep.pl line 81.
$ cat x
cat: x: No such file or directory

Yup, grep.pl just deleted the file named "x". The pipe character at the end of the file "rm x|" invoked Perl's opening a filehandle into a command functionality (with the 2-argument open). In other words, by just naming a file in a particular way, you've made grep.pl do something unexpected and potentially dangerous.

This might look like a horrible security hole (what if the name of that rogue file resulted in deleting all your files?), but it can also be a very powerful (albeit rarely used) feature. The moral of the story is don't run *any* tool over a set of files that you aren't familiar with.

You can also instead use "<<>>" instead of "<>". But this requires Perl version 5.22 or newer, which rarely seems to be on any system I try to use. This will force each input file to be opened as a file, not potentially as a command.

Unfortunately, it also prevents the special handling of input file named "-" to read standard input. This is a construct that I do use periodically.

Tutorial

Many Unix commands have the following semantics:

cmdname -options [input_file [input_file2 ...] ]

where the command will read from each input file sequentially, or from standard input if no input files are provided. File names can be wildcarded. Most such Unix commands allow you to supply "-" as a file name and the tool will read from standard input.

The diamond operator makes this ridiculously easy. Here's a minimal "cat" command in Perl:

#!/usr/bin/env perl
while (<>) {
  print $_;
}


That's the whole thing. It takes zero or more input files (if none, reads from standard input) and concatenates them  to standard out. Just like "cat".

Specifically what "<>" does is read one line from whatever input file is currently open. If it is at the end of the file, "<>" will automatically open the next file (if any) and read a line from it. As with many Perl built-ins, it leaves the just-read line in the "$_" variable.

You should be ready for the "Tips" section now.

Saturday, July 11, 2020

Perl Faster than Grep

So, I've been crawling through a debug log file that is 195 million lines long. I've been using a lot of "grep | wc" to count numbers of various log messages. Here's some timings for my Macbook Pro:

$ time cat dbglog.txt >/dev/null
real 0m35.423s

$ time wc dbglog.txt
195177935 1177117603 28533284864 dbglog.txt
real 1m44.560s

$ time egrep '999999' dbglog.txt
real 7m39.737s

(For this timing, I chose a pattern that would *NOT* be found.)

On the Macbook, the man page for fgrep claims that it is faster than grep. Let's see:

$ time fgrep '999999' dbglog.txt
real 7m11.365s

Well, I guess it's a little faster, but nothing to brag about.

Then I wanted to create a histogram of some findings, so I wrote a perl script to scan the file and create the histogram. Since it performed regular expression matching on every line, I assumed it would be a little slower than grep, since Perl is an interpreted language.

$ time ./count.pl dbglog.txt >count.out
real 3m9.427s

WOW! Less than half the time!

So I created a simple grep replacement: grep.pl. It doesn't do any histogramming, so it should be even faster.

$ time grep.pl '999999' dbglog.txt
real  2m8.341s

Amazing. Perl grep runs in less than a third the time of grep.

For small files, I bet Perl grep is slower starting up. Let's see.

$ time echo "hi" | grep 9999
real        0m0.051s

$ time echo "hi" | grep.pl 9999
real        0m0.113s

Yep. Grep saves you about 60 milliseconds. So if you had thousands of small files to grep, it might be faster to use grep.



UPDATE:

I got another big log file today (70 million lines) and saw something pretty surprising given my initial findings.