3

I have two text files. File 2 has logs over 1,000,000. File 1 has IP addresses line by line. I want to read file 2 lines and search these lines in file 1, I mean:

file 1:

34.123.21.32 45.231.43.21 21.34.67.98 

file 2 :

34.123.21.32 0.326 - [30/Oct/2013:06:00:06 +0200] 45.231.43.21 6.334 - [30/Oct/2013:06:00:06 +0200] 45.231.43.21 3.673 - [30/Oct/2013:06:00:06 +0200] 34.123.21.32 4.754 - [30/Oct/2013:06:00:06 +0200] 21.34.67.98 1.765 - [30/Oct/2013:06:00:06 +0200] ... 

I want to search for the IP from file 1 line by line in file 2 and print time arguments (example: 0.326) to a new file.

How can I do this?

3
  • 2
    What output are you expecting? Especially with duplicate ip-adresses?
    – Bernhard
    CommentedNov 4, 2013 at 15:25
  • I want to see time parametres about duplicate IP adresses
    – DessCnk
    CommentedNov 5, 2013 at 9:01
  • And I want to add a grep this time parameters.time parameters (4.008 ) has to be bigger than 10 seconds (bigger than 10.000).I have to add " egrep [0-9][0-9].[0-9][0-9] " command on my script.
    – DessCnk
    CommentedNov 5, 2013 at 9:06

3 Answers 3

3

Join + sort

If you're trying to find IP's that are present in both, you can use the join command but you'll need to use sort to pre-sort the files prior to joining them.

$ join -o 2.2 <(sort file1) <(sort file2) 

Example

$ join -o 2.2 <(sort file1) <(sort file2) 1.765 0.326 4.754 3.673 6.334 

Another example

file 1a:

$ cat file1a 34.123.21.32 45.231.43.21 21.34.67.98 1.2.3.4 5.6.7.8 9.10.11.12 

file 2a:

$ cat file2a 34.123.21.32 0.326 - [30/Oct/2013:06:00:06 +0200] 45.231.43.21 6.334 - [30/Oct/2013:06:00:06 +0200] 45.231.43.21 3.673 - [30/Oct/2013:06:00:06 +0200] 34.123.21.32 4.754 - [30/Oct/2013:06:00:06 +0200] 21.34.67.98 1.765 - [30/Oct/2013:06:00:06 +0200] 1.2.3.4 1.234 - [30/Oct/2013:06:00:06 +0200] 4.3.2.1 4.321 - [30/Oct/2013:06:00:06 +0200] 

Running the join command:

$ join -o 2.2 <(sort file1) <(sort file2) 1.234 1.765 0.326 4.754 3.673 6.334 

NOTE: The original order of file2 is lost with this method, due to the fact that we sorted it first. However this method only needs to scan file2 a single time now, as a result.

grep

You can use grep to search for matches in file2 using lines that are in file1, but this method isn't as efficient as the first method I showed you. It's scanning file2 looking for each line in file1.

$ grep -f file1 file2 | awk '{print $2}' 

Example

$ grep -f file1 file2 | awk '{print $2}' 0.326 6.334 3.673 4.754 1.765 1.234 

Improving grep's performance

You can speed up the grep's performance by using this form:

$ LC_ALL=C grep -f file1 file2 | awk '{print $2}' 

You can also tell grep that the stings in file1 are fixed length (-F) which will also help in getting better performance.

$ LC_ALL=C grep -Ff file1 file2 | awk '{print $2}' 

Generally in software, you try to avoid having to do this approach though, since it's basically a loop within a loop type of solution. But there are times when it's the best that can be achieved using a computer + software.

References

11
  • thanks for solution but results are not true.I mean; file1 line 1 is :34.123.21.32 , I want to search this Ip all of lines in file2. And I want to search all ıps one by one in file 1 in file2 and write results a new file.
    – DessCnk
    CommentedNov 5, 2013 at 12:35
  • You want to search for each IP in file1 for all occurrences of it in file2, yes? This solution is doing it, but it is sorting the files so that they're in order, and then pulling the results out for matches, do you need to keep them in the original order? The approach I'm doing is more efficient b/c it doesn't have to keep rescanning the file for matches, it only has to scan one time!
    – slm
    CommentedNov 5, 2013 at 13:04
  • @user50591 - see updates, I've added a 2nd method that uses grep and explained it's drawbacks. Also added additional info that hopefully explains why the join + sort is the fastest method.
    – slm
    CommentedNov 5, 2013 at 13:26
  • thanks a lot slm.Notw it is ok and working more efficient.I edit all my scripts and its ok!
    – DessCnk
    CommentedNov 5, 2013 at 13:33
  • I would have thought that grep -f is smart enough to go through the file only once. I stand corrected.
    – Joseph R.
    CommentedNov 5, 2013 at 14:17
2

You can tell grep to obtain its patterns from a file using the -f switch (which is in the POSIX standard):

sort file1 | uniq \ # Avoid duplicate entries in file1 | grep -f /dev/stdin file2 \ # Search in file2 for patterns piped on stdin | awk '{print $2}' \ # Print the second field (time) for matches > new_file # Redirect output to a new file 

Note that if one IP address appears multiple times in file2, all its time entries will be printed.

This did the job in less than 2 seconds on a 5 million-line file on my system.

3
  • Thanks for your sugestion but this command uses CPU %99,7 and proccess time is too long.
    – DessCnk
    CommentedNov 5, 2013 at 9:02
  • @user50591 How large is file1? Also, have you tried slm's answer?
    – Joseph R.
    CommentedNov 5, 2013 at 11:31
  • Thaks a lot,file1 size is 488K and I will try slm 's answer now.
    – DessCnk
    CommentedNov 5, 2013 at 12:13
1

As you have titled your question bash programming I'll submit a semi bash example.

Pure bash:

You could read ip filter-file and then check line by line and match it against these. But on this volume really slow.

You could rather easy implement bubble –, select –, insertion –, merge sort etc. but, again, for this kind of volume it would be a goner and most likely worse then a compare by line. (Depends a lot on volume of filter file).

sort + bash:

Another option would be to sort the file with sort and process the input in-house by e.g. binary search. This, as well, would be much slower then the other suggestions posted here, but let's give it a try.


Firstly it is a question about bash version. By version 4 (?) we have mapfile which reads file to array. This is a lot faster then the traditional read -ra …. Combined with sort it could be scripted by something like (for this task):

mapfile arr <<< "$(sort -bk1,1 "$file_in")" 

Then it is a question about having a search algorithm to find matches in this array. A simple way could be to use a binary search. It is efficient and on e.g. an array of 1.000.000 elements would give a fairly quick lookup.

declare -i match_index function in_array_bs() { local needle="$1" local -i max=$arr_len local -i min=0 local -i mid while ((min < max)); do (( (mid = ((min + max) >> 1)) < max )) || break if [[ "${arr[mid]// *}" < "$needle" ]]; then ((min = mid + 1)) else max=$mid fi done if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then match_index=$min return 0 fi return 1 } 

Then you would say:

for x in "${filter[@]}"; do if in_array_bs "$x"; then … # check match_index+0,+1,+2 etc. to cover duplicates. 

A sample script. (Not debugged) but merely as a starter. For lower volume where one would want only to depend on sort, it could be a template. But again s.l.o.w.e.r. b.y a. l.o.t.:

#!/bin/bash file_in="file_data" file_srch="file_filter" declare -a arr # The entire data file as array. declare -i arr_len # The length of "arr". declare -i index # Matching index, if any. # Time print helper function for debug. function prnt_ts() { date +"%H:%M:%S.%N"; } # Binary search. function in_array_bs() { local needle="$1" local -i max=$arr_len local -i min=0 local -i mid while ((min < max)); do (( (mid = ((min + max) >> 1)) < max )) || break if [[ "${arr[mid]// *}" < "$needle" ]]; then ((min = mid + 1)) else max=$mid fi done if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then index=$min return 0 fi return 1 } # Search. # "index" is set to matching index in "arr" by `in_array_bs()`. re='^[^ ]+ +([^ ]+)' function search() { if in_array_bs "$1"; then while [[ "${arr[index]// *}" == "$1" ]]; do [[ "${arr[index]}" =~ $re ]] printf "%s\n" "${BASH_REMATCH[1]}" ((++index)) done fi } sep="--------------------------------------------" # Timestamp start ts1=$(date +%s.%N) # Print debug information printf "%s\n%s MAP: %s\n%s\n" \ "$sep" "$(prnt_ts)" "$file_in" "$sep" >&2 # Read sorted file to array. mapfile arr <<< "$(sort -bk1,1 "$file_in")" # Print debug information. printf "%s\n%s MAP DONE\n%s\n" \ "$sep" "$(prnt_ts)" "$sep" >&2 # Define length of array. arr_len=${#arr[@]} # Print time start search printf "%s\n%s SEARCH BY INPUT: %s\n%s\n" \ "$sep" "$(prnt_ts)" "$file_srch" "$sep" >&2 # Read filter file. re_neg_srch='^[ '$'\t'$'\n'']*$' debug=0 while IFS=$'\n'$'\t'-" " read -r ip time trash; do if ! [[ "$ip" =~ $re_neg_srch ]]; then ((debug)) && printf "%s\n%s SEARCH: %s\n%s\n" \ "$sep" "$(prnt_ts)" "$ip" "$sep" >&2 # Do the search search "$ip" fi done < "$file_srch" # Print time end search printf "%s\n%s SEARCH DONE\n%s\n" \ "$sep" "$(prnt_ts)" "$sep" >&2 # Print total time ts2=$(date +%s.%N) echo $ts1 $ts2 | awk '{printf "TIME: %f\n", $2 - $1}' >&2 

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.