Filter columns in string with awk piped with xargs

Question

I have some files:

file1.csv file2.csv file3.csv

A given script processes them, logging to this file:

my.log

with this format: (filenamecol2col3):

file1.csv 1 a file2.csv 1 a file3.csv 1 a file2.csv 2 b file1.csv 2 b file3.csv 2 b file1.csv 3 c file2.csv 3 c file3.csv 3 c file2.csv 4 d file3.csv 4 d

I'd like to get one col3 value (only last one) from the my.log file for each *.csv file.

I run this command:

ls *.csv | xargs -I@ bash -c "cat my.log | grep @ | tail -n 1 | awk '{ print $3 }'"

It works well, except that awk is giving me all the columns.

file1.csv 3 c file2.csv 4 d file3.csv 4 d

How could I get only col3 column? For example, this:

c d d

JigglyNaga · Accepted Answer · 2019-08-28 09:23:51Z

In your expression

 "cat my.log | grep @ | tail -n 1 | awk '{ print $3 }'"

...the double-quotes around this string mean that the single-quotes are treated as literals. They don't protect $3 from the shell, so it's being expanded as an environment variable. As $3 is not actually defined by the shell (unless this is in a script that you've invoked with 3 arguments), it becomes the empty string, and the awk expression is simply { print }, printing the whole line.

You could fix this by escaping the $:

ls *.csv | xargs -I@ bash -c "cat my.log | grep @|tail -n 1|awk '{print \$3}'"

...or by moving the awk out of the xargs expression:

ls *.csv | xargs -I@ bash -c "cat my.log | grep @|tail -n 1"|awk '{print $3}'

Definitely, this is the right fix @JigglyNaga Thank you so much! — nandoquintana, CommentedAug 28, 2019 at 9:29
This addresses the bug in your specific implementation. But you could instead extract the fields you need in a single awk script (without ls or xargs), using an awk array. — JigglyNaga, CommentedAug 28, 2019 at 9:37
Yes, moving awk out of the xargs would print the right results... the drawback is that output would be delayed until all files are processed. — nandoquintana, CommentedAug 28, 2019 at 9:41
Have you tried that and observed a delay? It's a pipeline, so the awk {print $3} at the end will still process each line of input as it becomes available. And the array method will have to read the whole of my.log before printing its results, but only needs to read it once. — JigglyNaga, CommentedAug 28, 2019 at 9:45
You are absolutely right @JigglyNaga I was guessing but wrong. — nandoquintana, CommentedAug 28, 2019 at 10:02

cas · Accepted Answer · 2019-08-28 10:29:23Z

Piping the output of ls into xargs is a bad idea (in fact, doing anything with the output of ls other than simply viewing it in your terminal is a bad idea). If you absolutely must do something like this, at least use something like find . -maxdepth 1 -type f -iname '*.csv' -print0 and pipe that into xargs -0r.

But, in thise case, you don't need to do it at all because the filenames of your .csv files are already in my.log.

In awk:

#!/usr/bin/awk -f { seen[$1] = $3 } END { for (f in seen) { print seen[f] }; }

or as a one-liner:

$ awk '{seen[$1] = $3}; END {for (f in seen) { print seen[f] };}' my.log c d d

These will print the last value seen in column 3 for each file listed in column 1.

If you want it to print only the first value seen in column 3, change it to:

!seen[$1] { seen[$1] = $3 }

If you don't want to use find | xargs and you really need to use the filenames of all the .csv files currently in the current directory, one alternative is to do something like this:

#!/usr/bin/perl use strict; my $logfile=shift; # get the first arg (the logfile name) my $re=join("|",@ARGV); # turn the remaining args into a regular expression @ARGV=$logfile; # set the logfile name as the sole cmd-line argument. my %seen=(); while(<>) { next unless (m/^($re)/o); # ignore any filenames that weren't on the cmd line. my(@F) = split; $seen{$F[0]} = $F[2]; # perl arrays start from 0, not 1. }; foreach my $file (sort keys %seen) { print $seen{$file}, "\n"; };

save it as, e.g. nandro.pl, make it executable with chmod +x and run it as:

$ ./nandro.pl my.log *.csv c d d

elegant and it runs many times faster than running bash -c "cat | grep | tail | awk" for each .csv file. Both the awk and perl versions make only one pass through my.log. Your xargs-based "loop" makes one pass per .csv file. — cas, CommentedAug 28, 2019 at 11:50

Stack Exchange Network

Filter columns in string with awk piped with xargs

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Filter columns in string with awk piped with xargs

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions