Efficient duplication searching in UNIX and Perl

I wanted to post this for a few days, but haven’t had an opportunity to, given a lack of time and more “pressing” blog posts. Anyways, recently at work, I had a task where I had two text files, A and B. Both were lists, and I was tasked with seeing if any of the items in file A were in file B. The lists were very long: 25,000 lines and 12,000 lines respectively, so you couldn’t just read the lists, I had to do it using UNIX tools.

So, my first attempt was to try to use a shell script to do this:

for i in `cat A`; do
echo -n "$i: " && grep -c $i B

This would tell me how many times an item in “A” would be in “B” (hopefully 0). I ran this on one of our systems, and in about an hour, we got an alarm saying that this system’s CPU was at 100%. Oops. This was not an efficient search.

Someone suggested using “grep” in a more streamlined manner:

grep -f A B

But I got the same result. My associate, Thrig, suggested using Perl instead, so with some help from him, I wrote this script:


use warnings;
use strict;

my $fh; # Filehandle for file A
my $ofh; # The "other" filehandle, for file B
my %list;

# Open file A, load into a hash with values of "1"
open ($fh, "<", "$ARGV[0]") or die "Cannot open $ARGV[0]: $!n";

while (<$fh>) {
$list{$_} = 1;
close ($fh);

# Open second file, and loop through filehandle,
# check to see if item already exists in hash
open ($ofh, "<", "$ARGV[1]") or die "Cannot open $ARGV[1]: $!n";

while (<$ofh>) {
if ($list{$_}) {
print "$_ is a duplicate.n";

This script loads the first file, A, into a hash whose values were set to a default of “1”. This is a very efficient data structure, but uses a lot of memory for storage, though this is not an issue on the host in question. The second file, B, is opened up, and each line is search against the hash (A) I made earlier. The earlier shell script and “grep” command both took upwards of an hour, and only got through the first 5,000 lines. This is because each line of the file opened another sub-shell, consuming more resources and forking more processes on the system, hence the high CPU alarm.

However, the Perl method just loads the whole file into memory and searches through it as a single process. Instead of hours, this method took 5 minutes to complete!

It’s amazing what the right algorithm (and tools) can do to help speed up work.

As our boss at work used to say: Work smarter, not harder!

About Doug

A fellow who dwells upon the Pale Blue Dot who spends his days obsessing over things like Buddhism, KPop music, foreign languages, BSD UNIX and science fiction.

3 thoughts on “Efficient duplication searching in UNIX and Perl

  1. Ha ha ha, I hadn’t thought of that. Why would that work since you’re not explicitly opening either file.

    You and your Perl one-liners. ;)

  2. The -n option causes Perl to loop over input, in this case the two files listed. There’s actually another problem: duplicate lines in the second file will be printed. This is why your longer solution is more robust, as it clearly splits the hash building from the hash matching, while mine slaps them together.

Comments are closed.