This should do the trick:
awk -F '[<>]' ' NR!=1 && FNR==1{printf "\n"} FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} END{printf "\n"} ' $path_to_xml/*.xml > final_table.csv
Explanation:
awk
: use the program awk
, I tested it with GNU awk 4.0.1-F '[<>]'
: use <
and >
as field separatorsNR!=1 && FNR==1{printf "\n"}
: if it is not the first line overall (NR!=1
) but the first line of a file (FNR==1
) print a newlineFNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME}
: if it is the first line of a file, strip away anything up to the last /
(sub(".*/", "", FILENAME)
) in the name of the file (FILENAME
), strip a trailing .xml
(sub(".xml$", "", FILENAME)
) and print the result (printf FILENAME
)/double/{printf " %s", $3}
if a line contains "double" (/double/
), print a space followed by the third field (printf " %s", $3
). Using <
and >
as separators this would be the the number (with the first field being anything before the first <
and the second field being double
). If you want, you can format the numbers here. For example by using %8.3f
instead of %s
any number will be printed with 3 decimal places and an overall length (including dot and decimal places) of at least 8.- END{printf "\n"}: after the last line print an additional newline (this could be optional)
$path_to_xml/*.xml
: the list of files> final_table.csv
: put the result into final_table.csv
by redirecting the output
In the case of "argument list to long" errors, you can use find
with parameter -exec
to generate a file list instead of passing it directly:
find $path_to_xml -maxdepth 1 -type f -name '*.xml' -exec awk -F '[<>]' ' NR!=1 && FNR==1{printf "\n"} FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} END{printf "\n"} ' {} + > final_table.csv
Explanation:
find $path_to_xml
: tell find
to list files in $path_to_xml
-maxdepth 1
: do not descend into subfolders of $path_to_xml
-type f
: only list regular files (this also excludes $path_to_xml
itself)-name '*.xml': only list files that match the pattern
*.xml`, this needs to be quoted else the shell will try to expand the pattern-exec COMMAND {} +
: run the command COMMAND
with the matching files as parameters in place of {}
. +
indicates that multiple files may be passed at once, which reduces forking. If you use \;
(;
needs to be quoted else it is interpreted by the shell) instead of +
the command is run for each file separately.
You can also use xargs
in conjunction with find
:
find $path_to_xml -maxdepth 1 -type f -name '*.xml' -print0 | xargs -0 awk -F '[<>]' ' NR!=1 && FNR==1{printf "\n"} FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} END{printf "\n"} ' > final_table.csv
Explanation
-print0
: output list of files separated by null characters|
(pipe): redirects standard output of find
to the standard input of xargs
xargs
: builds and runs commands from standard input, i.e. run a command for each argument (here file names) passed.-0
: direct xargs
to assume arguments are separated by null characters
awk -F '[<>]' ' BEGINFILE {sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} ENDFILE {printf "\n"} ' $path_to_xml/*.xml > final_table.csv
where BEGINFILE
, ENDFILE
are called when changing file (if your awk supports it).
too many args
there's alwaysxargs
. Or you could export function then usefind...-exec func{} +...
for huge dirs. I do recall a limit of 32768 dirs on some bsd distro that screwed with some command combo. You could skipsed
and usetr
only to remove tag if element contents are just[0-9.]
as element is just<double></double>
,tr
can do all that. Do you need tocat
or can redirect straight togrep
. Maybe can cut down on pipes and use more redirection.