The Wayback Machine - https://web.archive.org/web/20160702102101/http://unix.stackexchange.com:80/questions/293058/find-files-that-contain-multiple-keywords-anywhere-in-the-file
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I'm looking for a way to list all files in a directory that contain the full set of keywords I'm seeking, anywhere in the file.

So, the keywords need not to appear on the same line.

One way to do this would be:

grep -l one $(grep -l two $(grep -l three *)) 

Three keywords is just an example, it could just as well be two, or four, and so on.

A second way I can think of is:

grep -l one * | xargs grep -l two | xargs grep -l three 

A third method, that appeared in another question, would be:

find . -type f \ -exec grep -q one {} \; -a \ -exec grep -q two {} \; -a \ -exec grep -q three {} \; -a -print 

But that's definitely not the direction I'm going here. I want something that requires less typing, and possibly just one call to grep, awk, perl or similar.

For example, I like how awk lets you match lines that contain all keywords, like:

awk '/one/ && /two/ && /three/' * 

Or, print just the file names:

awk '/one/ && /two/ && /three/ { print FILENAME ; nextfile }' * 

But I want to find files where the keywords may be anywhere in the file, not necessarily on the same line.


Preferred solutions would be gzip friendly, for example grep has the zgrep variant that works on compressed files. Why I mention this, is that some solutions may not work well given this constraint. For example, in the awk example of printing matching files, you can't just do:

zcat * | awk '/pattern/ {print FILENAME; nextfile}' 

You need to significantly change the command, to something like:

for f in *; do zcat $f | awk -v F=$f '/pattern/ { print F; nextfile }'; done 

So, because of the constraint, you need to call awk many times, even though you could do it only once with uncompressed files. And certainly, it would be nicer to just do zawk '/pattern/ {print FILENAME; nextfile}' * and get the same effect, so I would prefer solutions that allow this.

share|improve this question
1  
You don't need them to be gzip friendly, just zcat the files first. – terdonyesterday
    
@terdon I've edited the post, explaining why I mention that files are compressed. – arekolekyesterday
    
There isn't much difference between launching awk once or many times. I mean, OK, some small overhead but I doubt you would even notice the difference. It is, of course, possible to make the awk/perl whatever script do this itself but that starts to become a full blown program and not a quick one-liner. Is that what you want? – terdonyesterday
    
@terdon Personally, the more important aspect for me is how complicated the command will be (I guess my second edit came while you were commenting). For example, the grep solutions are easily adaptable just by prefixing grep calls with a z, there's no need for me to also handle file names. – arekolekyesterday
    
Yes, but that's grep. AFAIK, only grep and cat have standard "z-variants". I don't think you'll get anything simpler than using a for f in *; do zcat -f $f ... solution. Anything else would have to be a full program that checks file formats before opening or uses a library to do the same. – terdonyesterday
awk 'FNR == 1 { f1=f2=f3=0; }; /one/ { f1++ }; /two/ { f2++ }; /three/ { f3++ }; f1 && f2 && f3 { print FILENAME; nextfile; }' * 

If you want to automatically handle gzipped files, either run this in a loop with zcat (slow and inefficient because you'll be forking awk many times in a loop, once for each filename) or rewrite the same algorithm in perl and use the IO::Uncompress::AnyUncompress library module which can decompress several different kinds of compressed files (gzip, zip, bzip2, lzop). or in python, which also has modules for handling compressed files.


Here's a perl version that uses IO::Uncompress::AnyUncompress to allow for any number of patterns and any number of filenames (containing either plain text or compressed text).

All args before -- are treated as search patterns. All args after -- are treated as filenames. Primitive but effective option handling for this job. Better option handling (e.g. to support a -i option for case-insensitive searches) could be achieved with the Getopt::Std or Getopt::Long modules.

Run it like so:

$ ./arekolek.pl one two three -- *.gz *.txt 1.txt.gz 4.txt.gz 5.txt.gz 1.txt 4.txt 5.txt 

(I won't list files {1..6}.txt.gz and {1..6}.txt here...they just contain some or all of the words "one" "two" "three" "four" "five" and "six" for testing. The files listed in the output above DO contain all three of the search patterns. Test it yourself with your own data)

#! /usr/bin/perl use strict; use warnings; use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ; my %patterns=(); my @filenames=(); my $fileargs=0; # all args before '--' are search patterns, all args after '--' are # filenames foreach (@ARGV) { if ($_ eq '--') { $fileargs++ ; next }; if ($fileargs) { push @filenames, $_; } else { $patterns{$_}=1; }; }; my $pattern=join('|',keys %patterns); $pattern=qr($pattern); my $p_string=join('',sort keys %patterns); foreach my $f (@filenames) { #my $lc=0; my %s = (); my $z = new IO::Uncompress::AnyUncompress($f) or die "IO::Uncompress::AnyUncompress failed: $AnyUncompressError\n"; while ($_ = $z->getline) { #last if ($lc++ > 100); my @matches=( m/($pattern)/og); next unless (@matches); map { $s{$_}=1 } @matches; my $m_string=join('',sort keys %s); if ($m_string eq $p_string) { print "$f\n" ; last; } } } 

A hash %patterns is contains the complete set of patterns that files have to contain at least one of each member $_pstring is a string containing the sorted keys of that hash. The string $pattern contains a pre-compiled regular expression also built from the %patterns hash.

$pattern is compared against each line of each input file (using the /o modifier to compile $pattern only once as we know it won't ever change during the run), and map() is used to build a hash (%s) containing the matches for each file.

Whenever all the patterns have been seen in the current file (by comparing if $m_string (the sorted keys in %s) is equal to $p_string), print the filename and skip to the next file.

This is not a particularly fast solution, but is not unreasonably slow. The first version took 4m58s to search for three words in 74MB worth of compressed log files (totalling 937MB uncompressed). This current version takes 1m13s. There are probably further optimisations that could be made.

One obvious optimisation is to use this in conjunction with xargs's -P aka --max-procs to run multiple searches on subsets of the files in parallel. To do that, you need to count the number of files and divide by the number of cores/cpus/threads your system has (and round up by adding 1). e.g. there were 269 files being searched in my sample set, and my system has 6 cores (an AMD 1090T), so:

patterns=(one two three) searchpath='/var/log/apache2/' cores=6 filecount=$(find "$searchpath" -type f -name 'access.*' | wc -l) filespercore=$((filecount / cores + 1)) find "$searchpath" -type f -print0 | xargs -0r -n "$filespercore" -P "$cores" ./arekolek.pl "${patterns[@]}" -- 

With that optimisation, it took only 23 seconds to find all 18 matching files. Of course, the same could be done with any of the other solutions. NOTE: The order of filenames listed in the output will be different, so may need to be sorted afterwards if that matters.

As noted by @arekolek, multiple zgreps with find -exec or xargs can do it significantly faster, but this script has the advantage of supporting any number of patterns to search for, and is capable of dealing with several different types of compression.

If the script is limited to examining only the first 100 lines of each file, it runs through all of them (in my 74MB sample of 269 files) in 0.6 seconds. If this is useful in some cases, it could be made into a command line option (e.g. -l 100) but it has the risk of not finding all matching files.


BTW, according to the man page for IO::Uncompress::AnyUncompress, the compression formats supported are:


One last (I hope) optimisation. By using the PerlIO::gzip module (packaged in debian as libperlio-gzip-perl) instead of IO::Uncompress::AnyUncompress I got the time down to about 3.1 seconds for processing my 74MB of log files. There were also some small improvements by using a simple hash rather than Set::Scalar (which also saved a few seconds with the IO::Uncompress::AnyUncompress version).

PerlIO::gzip was recommended as the fastest perl gunzip in http://stackoverflow.com/a/1539271/137158 (found with a google search for perl fast gzip decompress)

Using xargs -P with this didn't improve it at all. In fact it even seemed to slow it down by anywhere from 0.1 to 0.7 seconds. (I tried four runs and my system does other stuff in the background which will alter the timing)

The price is that this version of the script can only handle gzipped and uncompressed files. Speed vs flexibility: 3.1 seconds for this version vs 23 seconds for the IO::Uncompress::AnyUncompress version with an xargs -P wrapper (or 1m13s without xargs -P).

#! /usr/bin/perl use strict; use warnings; use PerlIO::gzip; my %patterns=(); my @filenames=(); my $fileargs=0; # all args before '--' are search patterns, all args after '--' are # filenames foreach (@ARGV) { if ($_ eq '--') { $fileargs++ ; next }; if ($fileargs) { push @filenames, $_; } else { $patterns{$_}=1; }; }; my $pattern=join('|',keys %patterns); $pattern=qr($pattern); my $p_string=join('',sort keys %patterns); foreach my $f (@filenames) { open(F, "<:gzip(autopop)", $f) or die "couldn't open $f: $!\n"; #my $lc=0; my %s = (); while (<F>) { #last if ($lc++ > 100); my @matches=(m/($pattern)/ogi); next unless (@matches); map { $s{$_}=1 } @matches; my $m_string=join('',sort keys %s); if ($m_string eq $p_string) { print "$f\n" ; close(F); last; } } } 
share|improve this answer
    
for f in *; do zcat $f | awk -v F=$f '/one/ {a++}; /two/ {b++}; /three/ {c++}; a&&b&&c { print F; nextfile }'; done works fine, but indeed, takes 3 times as long as my grep solution, and is actually more complicated. – arekolekyesterday
1  
OTOH, for plain text files it would be faster. and the same algorithm implemented in a language with support for reading compressed files (like perl or python) as i suggested would be faster than multiple greps. "complication" is partially subjective - personally, i think a single awk or perl or python script is less complicated than multiple greps with or without find....@terdon's answer is good, and does it without needing the module I mentioned (but at the cost of forking zcat for every compresssed file) – casyesterday
    
I had to apt-get install libset-scalar-perl to use the script. But it doesn't seem to terminate in any reasonable time. – arekolek23 hours ago
    
how many and what size (compressed and uncompressed) are the files you're searching? dozens or hundreds of small-medium size files or thousands of large ones? – cas20 hours ago
    
Here's a histogram of the sizes of compressed files (20 to 100 files, up to 50MB but mostly below 5MB). Uncompressed look the same, but with sizes multiplied by 10. – arekolek19 hours ago

Set record separator to . so that awk will treat whole file as a one line:

awk -v RS='.' '/one/&&/two/&&/three/{print FILENAME}' * 

Similarly with perl:

perl -ln00e '/one/&&/two/&&/three/ && print $ARGV' * 
share|improve this answer
3  
Neat. Note that this will load the whole file into memory though and that might be a problem for large files. – terdonyesterday
    
I initially upvoted this, because it looked promising. But I can't get it to work with gzipped files. for f in *; do zcat $f | awk -v RS='.' -v F=$f '/one/ && /two/ && /three/ { print F }'; done outputs nothing. – arekolekyesterday
    
@arekolek That loop works for me. Are your files properly gzipped? – jimmijyesterday
    
@arekolek you need zcat -f "$f" if some of the files are not compressed. – terdonyesterday
    
I've tested it also on uncompressed files and awk -v RS='.' '/bfs/&&/none/&&/rgg/{print FILENAME}' greptest/*.txt still returns no results, while grep -l rgg $(grep -l none $(grep -l bfs greptest/*.txt)) returns expected results. – arekolek22 hours ago

For compressed files, you could loop over each file and decompress first. Then, with a slightly modified version of the other answers, you can do:

for f in *; do zcat -f "$f" | perl -ln00e '/one/&&/two/&&/three/ && exit(0); }{ exit(1)' && printf '%s\n' "$f" done 

The Perl script will exit with 0 status (success) if all three strings were found. The }{ is Perl shorthand for END{}. Anything following it will be executed after all input has been processed. So the script will exit with a non-0 exit status if not all the strings were found. Therefore, the && printf '%s\n' "$f" will print the file name only if all three were found.

Or, to avoid loading the file into memory:

for f in *; do zcat -f "$f" 2>/dev/null | perl -lne '$k++ if /one/; $l++ if /two/; $m++ if /three/; exit(0) if $k && $l && $m; }{ exit(1)' && printf '%s\n' "$f" done 

Finally, if you really want to do the whole thing in a script, you could do:

#!/usr/bin/env perl use strict; use warnings; ## Get the target strings and file names. The first three ## arguments are assumed to be the strings, the rest are ## taken as target files. my ($str1, $str2, $str3, @files) = @ARGV; FILE:foreach my $file (@files) { my $fh; my ($k,$l,$m)=(0,0,0); ## only process regular files next unless -f $file ; ## Open the file in the right mode $file=~/.gz$/ ? open($fh,"-|", "zcat $file") : open($fh, $file); ## Read through each line while (<$fh>) { $k++ if /$str1/; $l++ if /$str2/; $m++ if /$str3/; ## If all 3 have been found if ($k && $l && $m){ ## Print the file name print "$file\n"; ## Move to the net file next FILE; } } close($fh); } 

Save the script above as foo.pl somewhere in your $PATH, make it executable and run it like this:

foo.pl one two three * 
share|improve this answer

Of all the solutions proposed so far, my original solution using grep is the fastest one, finishing in 25 seconds. It's drawback is that it's tedious to add and remove keywords. So I came up with a script (dubbed multi) that simulates the behavior, but allows to change the syntax:

#!/bin/bash # Usage: multi [z]grep PATTERNS -- FILES command=$1 # first two arguments constitute the first command command_head="$1 -l $2" shift 2 # arguments before double-dash are keywords to be piped with xargs while (("$#")) && [ $1 != -- ] ; do command_tail+="| xargs $command -l $1 " shift done shift # remaining arguments are files eval "$command_head $@ $command_tail" 

So now, writing multi grep one two three -- * is equivalent to to my original proposal and runs in the same time. I can also easily use it on compressed files by using zgrep as the first argument instead.

Other solutions

I also experimented with a Python script using two strategies: searching for all keywords line by line, and searching in whole file keyword by keyword. The second strategy was faster in my case. But it was slower than just using grep, finishing in 33 seconds. Line by line keyword matching finished in 60 seconds.

#!/usr/bin/python3 import gzip, sys i = sys.argv.index('--') patterns = sys.argv[1:i] files = sys.argv[i+1:] for f in files: with (gzip.open if f.endswith('.gz') else open)(f, 'rt') as s: txt = s.read() if all(p in txt for p in patterns): print(f) 

The script given by terdon finished in 54 seconds. Actually it took 39 seconds of wall time, because my processor is dual core. Which is interesting, because my Python script took 49 seconds of wall time (and grep was 29 seconds).

The script by cas failed to terminate in reasonable time, even on a smaller number of files that were processed with grep under 4 seconds, so I had to kill it.

But his original awk proposal, even though it's slower than grep as is, has potential advantage. In some cases, at least in my experience, it's possible to expect that all the keywords should all appear somewhere in the head of the file if they are in the file at all. This gives this solution a dramatic boost in performance:

for f in *; do zcat $f | awk -v F=$f \ 'NR>100 {exit} /one/ {a++} /two/ {b++} /three/ {c++} a&&b&&c {print F; exit}' done 

Finishes in a quarter of a second, as opposed to 25 seconds.

Of course, we may not have the advantage of searching for keywords that are known to occur near the beginning of the files. In such case, solution without NR>100 {exit} takes 63 seconds (50s of wall time).

Uncompressed files

There's no significant difference in running time between my grep solution and cas' awk proposal, both take a fraction of a second to execute.

Note that the variable initialization FNR == 1 { f1=f2=f3=0; } is mandatory in such case to reset the counters for every subsequent processed file. As such, this solution requires editing the command in three places if you want to change a keyword or add new ones. On the other hand, with grep you can just append | xargs grep -l four or edit the keyword you want.

A disadvantage of grep solution that uses command substitution, is that it will hang if anywhere in the chain, before the last step, there are no matching files. This doesn't affect the xargs variant because the pipe will be aborted once grep returns a non-zero status. I've updated my script to use xargs so I don't have to handle this myself, making the script simpler.

share|improve this answer
    
Your Python solution may benefit from pushing the loop down to C layer with not all(p in text for p in patterns) – iruvar21 hours ago
    
@iruvar Thanks for the suggestion. I've tried it (sans not) and it finished in 32 seconds, so not that much of improvement, but it's certainly more readable. – arekolek21 hours ago
    
you could use an associative array rather than f1,f2,f3 in awk, with key=search-pattern, val=count – cas8 hours ago
    
@arekolek see my latest version using PerlIO::gzip rather than IO::Uncompress::AnyUncompress. now takes only 3.1 seconds instead of 1m13s to process my 74MB of log files. – cas4 hours ago
    
BTW, if you have previously run eval $(lesspipe) (e.g. in your .profile, etc), you can use less instead of zcat -f and your for loop wrapper around awk will be able to process any kind of file that less can (gzip, bzip2, xz, and more)....less can detect if stdout is a pipe and will just output a stream to stdout if it is. – cas46 mins ago

Another option - feed words one at a time to xargs for it to run grep against the file. xargs can itself be made to exit as soon as an invocation of grep returns failure by returning 255 to it (check the xargs documentation). Of course the spawning of shells and forking involved in this solution will likely slow it down significantly

printf '%s\n' one two three | xargs -n 1 sh -c 'grep -q $2 $1 || exit 255' _ file 

and to loop it up

for f in *; do if printf '%s\n' one two three | xargs -n 1 sh -c 'grep -q $2 $1 || exit 255' _ "$f" then printf '%s\n' "$f" fi done 
share|improve this answer
    
This looks nice, but I'm not sure how to use this. What is _ and file? Will this search in multiple files passed as argument and return files that contain all the keywords? – arekolek15 hours ago
    
@arekolek, added a loop version. And as for _, it's being passed as the $0 to the spawned shell - this would show up as the command name in the output of ps - I would defer to the master here – iruvar14 hours ago

Not the answer you're looking for? Browse other questions tagged or ask your own question.

close