Efficient duplication searching in UNIX and Perl

I wanted to post this for a few days, but haven’t had an opportunity to, given a lack of time and more “pressing” blog posts. Anyways, recently at work, I had a task where I had two text files, A and B. Both were lists, and I was tasked with seeing if any of the items in file A were in file B. The lists were very long: 25,000 lines and 12,000 lines respectively, so you couldn’t just read the lists, I had to do it using UNIX tools.

So, my first attempt was to try to use a shell script to do this:

for i in `cat A`; do
echo -n "$i: " && grep -c $i B
done

This would tell me how many times an item in “A” would be in “B” (hopefully 0). I ran this on one of our systems, and in about an hour, we got an alarm saying that this system’s CPU was at 100%. Oops. This was not an efficient search.

Someone suggested using “grep” in a more streamlined manner:

grep -f A B

But I got the same result. My associate, Thrig, suggested using Perl instead, so with some help from him, I wrote this script:

#!/usr/bin/perl

use warnings;
use strict;

my $fh; # Filehandle for file A
my $ofh; # The "other" filehandle, for file B
my %list;

# Open file A, load into a hash with values of "1"
open ($fh, "<", "$ARGV[0]") or die "Cannot open $ARGV[0]: $!\n";

while (<$fh>) {
chomp();
$list{$_} = 1;
}
close ($fh);

# Open second file, and loop through filehandle,
# check to see if item already exists in hash
open ($ofh, "<", "$ARGV[1]") or die "Cannot open $ARGV[1]: $!\n";

while (<$ofh>) {
chomp();
if ($list{$_}) {
print "$_ is a duplicate.\n";
}
}

This script loads the first file, A, into a hash whose values were set to a default of “1″. This is a very efficient data structure, but uses a lot of memory for storage, though this is not an issue on the host in question. The second file, B, is opened up, and each line is search against the hash (A) I made earlier. The earlier shell script and “grep” command both took upwards of an hour, and only got through the first 5,000 lines. This is because each line of the file opened another sub-shell, consuming more resources and forking more processes on the system, hence the high CPU alarm.

However, the Perl method just loads the whole file into memory and searches through it as a single process. Instead of hours, this method took 5 minutes to complete!

It’s amazing what the right algorithm (and tools) can do to help speed up work.

As our boss at work used to say: Work smarter, not harder!

About these ads

About Doug

A Buddhist, father and Japanophile / Koreaphile.
This entry was posted in Linux, Macintosh, Technology. Bookmark the permalink.

3 Responses to Efficient duplication searching in UNIX and Perl

  1. thrig says:

    Assuming no duplicates in A, the script can be optimized to:

    perl -nle ‘print if $s{$_}++’ fileA fileB
    :)

  2. Gerald Ford says:

    Ha ha ha, I hadn’t thought of that. Why would that work since you’re not explicitly opening either file.

    You and your Perl one-liners. ;)

  3. thrig says:

    The -n option causes Perl to loop over input, in this case the two files listed. There’s actually another problem: duplicate lines in the second file will be printed. This is why your longer solution is more robust, as it clearly splits the hash building from the hash matching, while mine slaps them together.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s