As you have titled your question bash programming I'll submit a semi bash example.
Pure bash:
You could read ip filter-file and then check line by line and match it against these. But on this volume really slow.
You could rather easy implement bubble –, select –, insertion –, merge sort etc. but, again, for this kind of volume it would be a goner and most likely worse then a compare by line. (Depends a lot on volume of filter file).
sort + bash:
Another option would be to sort the file with sort
and process the input in-house by e.g. binary search. This, as well, would be much slower then the other suggestions posted here, but let's give it a try.
Firstly it is a question about bash version. By version 4 (?) we have mapfile
which reads file to array. This is a lot faster then the traditional read -ra …
. Combined with sort
it could be scripted by something like (for this task):
mapfile arr <<< "$(sort -bk1,1 "$file_in")"
Then it is a question about having a search algorithm to find matches in this array. A simple way could be to use a binary search. It is efficient and on e.g. an array of 1.000.000 elements would give a fairly quick lookup.
declare -i match_index function in_array_bs() { local needle="$1" local -i max=$arr_len local -i min=0 local -i mid while ((min < max)); do (( (mid = ((min + max) >> 1)) < max )) || break if [[ "${arr[mid]// *}" < "$needle" ]]; then ((min = mid + 1)) else max=$mid fi done if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then match_index=$min return 0 fi return 1 }
Then you would say:
for x in "${filter[@]}"; do if in_array_bs "$x"; then … # check match_index+0,+1,+2 etc. to cover duplicates.
A sample script. (Not debugged) but merely as a starter. For lower volume where one would want only to depend on sort
, it could be a template. But again s.l.o.w.e.r. b.y a. l.o.t.:
#!/bin/bash file_in="file_data" file_srch="file_filter" declare -a arr # The entire data file as array. declare -i arr_len # The length of "arr". declare -i index # Matching index, if any. # Time print helper function for debug. function prnt_ts() { date +"%H:%M:%S.%N"; } # Binary search. function in_array_bs() { local needle="$1" local -i max=$arr_len local -i min=0 local -i mid while ((min < max)); do (( (mid = ((min + max) >> 1)) < max )) || break if [[ "${arr[mid]// *}" < "$needle" ]]; then ((min = mid + 1)) else max=$mid fi done if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then index=$min return 0 fi return 1 } # Search. # "index" is set to matching index in "arr" by `in_array_bs()`. re='^[^ ]+ +([^ ]+)' function search() { if in_array_bs "$1"; then while [[ "${arr[index]// *}" == "$1" ]]; do [[ "${arr[index]}" =~ $re ]] printf "%s\n" "${BASH_REMATCH[1]}" ((++index)) done fi } sep="--------------------------------------------" # Timestamp start ts1=$(date +%s.%N) # Print debug information printf "%s\n%s MAP: %s\n%s\n" \ "$sep" "$(prnt_ts)" "$file_in" "$sep" >&2 # Read sorted file to array. mapfile arr <<< "$(sort -bk1,1 "$file_in")" # Print debug information. printf "%s\n%s MAP DONE\n%s\n" \ "$sep" "$(prnt_ts)" "$sep" >&2 # Define length of array. arr_len=${#arr[@]} # Print time start search printf "%s\n%s SEARCH BY INPUT: %s\n%s\n" \ "$sep" "$(prnt_ts)" "$file_srch" "$sep" >&2 # Read filter file. re_neg_srch='^[ '$'\t'$'\n'']*$' debug=0 while IFS=$'\n'$'\t'-" " read -r ip time trash; do if ! [[ "$ip" =~ $re_neg_srch ]]; then ((debug)) && printf "%s\n%s SEARCH: %s\n%s\n" \ "$sep" "$(prnt_ts)" "$ip" "$sep" >&2 # Do the search search "$ip" fi done < "$file_srch" # Print time end search printf "%s\n%s SEARCH DONE\n%s\n" \ "$sep" "$(prnt_ts)" "$sep" >&2 # Print total time ts2=$(date +%s.%N) echo $ts1 $ts2 | awk '{printf "TIME: %f\n", $2 - $1}' >&2