Efficient data extraction from multiple files to a single CSV file

Question

I have a large collection of XML files with the same exact structure:

$ cat file_<ID>.xml ... ... ... <double>1.2342</double> <double>2.3456</double> ... ... ... ...

where the number of such <double> entries in each XML file is fixed and known (in my particular case, 168).

I need to build a single csv file with the content of all these XML files stored as follows:

file_0001 1.2342 2.3456 ... file_0002 1.2342 2.3456 ...

etc.

How can I do this efficiently?

The best I have come up with is this:

#!/usr/bin/env zsh for x in $path_to_xmls/*.xml; do # 1) Get the doubles ignoring everything else # 2) Remove line breaks within the same file # 3) Add a new line at the end to construct the CSV file # 4) Join the columns together cat $x | grep -F '<double>' | \ sed -r 's/.*>([0-9]+\.*[0-9]*).*?/\1/' | \ tr '\n' ' ' | sed -e '$a\' | >> table_numbers.csv echo ${x:t} >> file_IDs.csv done paste file_IDs table_numbers.csv > final_table.csv

When I time the above script in a folder with ~10K XML files I get:

./from_xml_to_csv.sh 100.45s user 94.84s system 239% cpu 1:21.48 total

not terrible, but I am hoping to work with 100x or 1000x more files. How can I make this processing more efficient?

Also, with my solution above, could I ever end up in a situation where the glob expansion reaches a limit, e.g. when working with millions of files? (the typical "too many args" problem).

Update

For anyone interested in a great solution to this problem, please read @mikeserve's answer. It is the fastest and the one that scales up the best by far.

I'm no shell expert but for the too many args there's always xargs. Or you could export function then use find...-exec func{} +... for huge dirs. I do recall a limit of 32768 dirs on some bsd distro that screwed with some command combo. You could skip sed and use tr only to remove tag if element contents are just [0-9.] as element is just <double></double>, tr can do all that. Do you need to cat or can redirect straight to grep. Maybe can cut down on pipes and use more redirection. — gwillie, CommentedJul 23, 2015 at 0:56
It's hard to answer XML processing questions when your XML isn't valid. Any chance you could do a more complete version? — Sobrique, CommentedAug 7, 2015 at 11:56

Jeff Schaller · Accepted Answer · 2018-01-21 16:28:27Z

This should do the trick:

awk -F '[<>]' ' NR!=1 && FNR==1{printf "\n"} FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} END{printf "\n"} ' $path_to_xml/*.xml > final_table.csv

Explanation:

awk: use the program awk, I tested it with GNU awk 4.0.1
-F '[<>]': use < and > as field separators
NR!=1 && FNR==1{printf "\n"}: if it is not the first line overall (NR!=1) but the first line of a file (FNR==1) print a newline
FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME}: if it is the first line of a file, strip away anything up to the last / (sub(".*/", "", FILENAME)) in the name of the file (FILENAME), strip a trailing .xml (sub(".xml$", "", FILENAME)) and print the result (printf FILENAME)
/double/{printf " %s", $3} if a line contains "double" (/double/), print a space followed by the third field (printf " %s", $3). Using < and > as separators this would be the the number (with the first field being anything before the first < and the second field being double). If you want, you can format the numbers here. For example by using %8.3f instead of %s any number will be printed with 3 decimal places and an overall length (including dot and decimal places) of at least 8.
END{printf "\n"}: after the last line print an additional newline (this could be optional)
$path_to_xml/*.xml: the list of files
> final_table.csv: put the result into final_table.csv by redirecting the output

In the case of "argument list to long" errors, you can use find with parameter -exec to generate a file list instead of passing it directly:

find $path_to_xml -maxdepth 1 -type f -name '*.xml' -exec awk -F '[<>]' ' NR!=1 && FNR==1{printf "\n"} FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} END{printf "\n"} ' {} + > final_table.csv

Explanation:

find $path_to_xml: tell find to list files in $path_to_xml
-maxdepth 1: do not descend into subfolders of $path_to_xml
-type f: only list regular files (this also excludes $path_to_xml itself)
-name '*.xml': only list files that match the pattern*.xml`, this needs to be quoted else the shell will try to expand the pattern
-exec COMMAND {} +: run the command COMMAND with the matching files as parameters in place of {}. + indicates that multiple files may be passed at once, which reduces forking. If you use \; (; needs to be quoted else it is interpreted by the shell) instead of + the command is run for each file separately.

You can also use xargs in conjunction with find:

find $path_to_xml -maxdepth 1 -type f -name '*.xml' -print0 | xargs -0 awk -F '[<>]' ' NR!=1 && FNR==1{printf "\n"} FNR==1{sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} END{printf "\n"} ' > final_table.csv

Explanation

-print0: output list of files separated by null characters
| (pipe): redirects standard output of find to the standard input of xargs
xargs: builds and runs commands from standard input, i.e. run a command for each argument (here file names) passed.
-0: direct xargs to assume arguments are separated by null characters

awk -F '[<>]' ' BEGINFILE {sub(".*/", "", FILENAME); sub(".xml$", "", FILENAME); printf FILENAME} /double/{printf " %s", $3} ENDFILE {printf "\n"} ' $path_to_xml/*.xml > final_table.csv

where BEGINFILE, ENDFILE are called when changing file (if your awk supports it).

Thanks -- I'm afraid I now get "argument list too long: awk" . — Amelio Vazquez-Reina, CommentedJul 23, 2015 at 14:19
This is great. Thanks!! -- The explanations are very helpful. I measured the latest solutions and they don't have the "argument list too long" errors anymore. They run on ~20 secs against 10K XML files, making them on par with @mikeserv's slowest solution, but slower than mikeserv's faster solution. This is definitely a GREAT and very clear solution to the problem! — Amelio Vazquez-Reina, CommentedJul 24, 2015 at 23:17

mikeserv · Accepted Answer · 2015-07-24 01:39:10Z

Regarding glob expansion possibly exceeding a limit - yes and no. The shell's already running, and so it won't stop. But if you were to pass the entire globbed array as arguments to a single command, then yes, that is a definite possibility. The portable and robust way to handle this involves find...

find . \! -name . -prune -name pattern -type f -exec cat {} + | ...

...which will only cat regular files in the current directory with a name which matches pattern, but will also only invoke cat as many times as is necessary to avoid exceeding ARG_MAX.

In fact, though, since you have a GNU sed we can almost do the whole thing with just sed in a find script.

cd /path/to/xmls find . \! -name . -prune -name \*.xml -type f -exec \ sed -sne'1F;$x;/\n*\( \)*<\/*double>/!d' \ -e '$s//\1/gp;H' {} + | paste -d\\0 - -

I thought of another way. This will be very fast, but it absolutely depends on there being exactly 168 matches per file, and there can only be the one . dot in the filenames.

( export LC_ALL=C; set '' - - while [ "$#" -lt 168 ]; do set "$@$@"; done shift "$((${#}-168))" find . \! -name . -prune -name \*.xml -type f \ -exec grep -F '<double>' /dev/null {} + | tr \<: '>>' | cut -d\> -f1,4 | paste -d\ "$@" | sed 'h;s|./[^>]*>||g;x;s|\.x.*||;s|..||;G;s|\n| |' )

As requested, here's a little breakdown of how this command works:

( ... )
- In the first place, the whole little script is run within its own subshell because there are a few global environmental properties that we'll be altering in the course of its execution, and this way when the job is done all of the properties we alter will be restored to their original values - whatever they were.
export LC_ALL=C; set '' - -
- By setting the current locale to C we can save our filters a lot of effort. In a UTF-8 locale any char might be represented by one or several bytes a piece, and any char found will need to be selected out of a group of many thousand possibles. In the C locale each char is a single byte, and there are only 128 of them. It makes char matching a much quicker affair overall.
- The set statement changes the shell's positional parameters. Doing set '' - - sets $1 to \0, and $2 and $3 to -.
while ... set "$@$@"; done; shift ...
- Basically the whole point of this statement is to get an array of 168 dashes. We'll be using paste later to replace sequential sets of 167 newlines with spaces, while preserving the 168th. The most simple way to do this is to give it 168 argument references to - stdin and tell it to paste all of those together.
find ... -exec grep -F '<double>' /dev/null' ...
- The find bit has been previously discussed, but with grep we print only those lines which can be matched against the -Fixed string <double>. By making grep's first argument /dev/null - which is a file that can never match our string - we ensure that grep is always searching 2 or more file arguments for every invocation. When invoked with 2 or more named search files grep will always print the filename like file_000.xml: at the head of every output line.
tr \<: '>>'
- Here we translate every occurrence in grep's output of either : or < characters to >.
- At this point a sample matched line will look like ./file_000.xml> >double>0.0000>/double>.
cut -d\> -f1,4
- cut will strip from its output all of its input which cannot be found within either the 1st or 4th fields as divided by > chars.
- At this point a sample matched line will look like ./file_000.xml>0.0000.
paste -d\ "$@"
- Already discussed, but here we paste input lines in batches of 168.
- At this point 168 matched lines occur together like: ./file_000.xml>0.000 .../file_000.xml>0.167
sed 'h;s|./[^>]*>||g;x;s|\.xml.*||;s|..||;G;s|\n| |'
- Now the faster, smaller utilities have already done the majority of the work. On a multicore system, they've probably even done it concurrently. And those utilities - especiallycut and paste are far faster at what they do than any attempt at emulation we might do with higher-level utilities like sed, or, even worse, awk. But I've taken it as far as I could imagine I might do this far, and I have to call on sed.
- First I hold a copy of every input line, then I globally remove every occurrence of the pattern ./[^>]*> in pattern space - so every occurrence of the filename. At this point sed's pattern space looks like: 0.000 0.0001...0.167
- Then I exchange hold and pattern spaces and remove everything from \.xml.* on - so everything from the first filename on the saved copy of the line on. I then strip the first two chars - or ./ as well - and at this point pattern space looks like file_000.
- So all that remains is to stick them together. I Get a copy of hold space appended to pattern space following a \newline char, then I s///ubstitute the \newline for a space.
- And so, finally, pattern space looks like file_000 0.000...0.167. And that is what sed writes to output for each file find passes to grep.

Thanks -- but I am afraid I get: sed: -e expression #2, char 17: unknown option to 's' find: 'grep' terminated by signal 13 find: 'grep' terminated by signal 13. I am using GNU's variants of these tools in OS X. — Amelio Vazquez-Reina, CommentedJul 23, 2015 at 15:15
@AmelioVazquez-Reina - oh - zsh? zsh does a weird thing with $# in math expansions. Sorry I should have done ${#} to protect against that in the first place. — mikeserv, CommentedJul 23, 2015 at 20:38
Thanks @mikeserv! Both solutions now work great. This a huge speed up from the original solution in the OP (and 100X faster than a Python solution that properly loads xml's one by one to build a CSV dataframe in Pandas). Current times in my machine: Sol (1): 19 seconds. Sol (2): 14 seconds, both transforming ~10K XML files into a single CSV file. — Amelio Vazquez-Reina, CommentedJul 24, 2015 at 18:21
Thanks @mikeserv I prepended Sol (1) with export LC_ALL=C; and the timing for it was still ~ 20.28 seconds (i.e. ~unchanged). — Amelio Vazquez-Reina, CommentedJul 24, 2015 at 21:23
@AmelioVazquez-Reina - even with all of the additional setup and the additional execs and etc, it's this kind of stuff that makes shell pipelines powerful. Unix is timeshared - and so if there are multiple processes running in a foreground process group they will get execution priority - and each winds up operating on its own processor core. And so there is no delay when the data is passed around - it's just many hands making light work. The key is to make the execs count - a single pipeline which handles all data in a job from start to finish, and they're almost always worth it. — mikeserv, CommentedJul 24, 2015 at 21:39

Sobrique · Accepted Answer · 2015-08-07 12:07:17Z

Please, on behalf of future maintenance programmers and sysadmins - DON'T use a regex to parse XML. XML is a structured data type, and it is NOT well suited for regex parsing - you can 'fake it' by pretending it's plain text, but there's a bunch of semantically identical things in XML that don't parse the same. You can embed linefeeds, and have unary tags for example.

Thus - use a parser - I have mocked up some source data, because your XML isn't valid. Give me a more complete sample, and I'll give you a more complete answer.

At a basic level - we extract the double nodes like this:

#!/usr/bin/env perl use strict; use warnings; use XML::Twig; my $twig = XML::Twig -> new; $twig -> parse ( \*DATA ); foreach my $double ( $twig -> get_xpath('//double') ) { print $double -> trimmed_text,"\n"; } __DATA__ <root> <subnode> <another_node> <double>1.2342</double> <double>2.3456</double> <some_other_tag>fish</some_other_tag> </another_node> </subnode> </root>

This prints:

1.2342 2.3456

So we expand this:

#!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Text::CSV; my $twig = XML::Twig->new; my $csv = Text::CSV->new; #open our results file open( my $output, ">", "results.csv" ) or die $!; #iterate each XML File. foreach my $filename ( glob("/path/to/xml/*.xml") ) { #parse it $twig->parsefile($filename); #extract all the text of all the 'double' elements. my @doubles = map { $_->trimmed_text } $twig->get_xpath('//double'); #print it as comma separated. $csv->print( $output, [ $filename, @doubles ] ); } close($output);

I think that should do the trick (without sample data, I can't say for sure). But note - by using an XML parser we don't get tripped up with some of the XML reformatting that can be done perfectly validly (as according to the XML spec). By using a CSV parser, we aren't going to get caught out by any fields with embedded commas or line feeds.

If you're looking for more specific nodes - you can specify a more detailed path. As it is, the above just looks for any instance of double. But you can use:

get_xpath("/root/subnode/another_node/double")

amisax · Accepted Answer · 2015-07-23 04:11:21Z

0

You can try this single liner for each file. The awk multiple delimiters does efficient splitting and tr concats all lines in memory , rather than on disk.

for f in `ls *.xml` ; do echo $f,`grep double $f | awk -F '[<>]' '{print $3}' | tr '\n' ','`; done

I can not profile this at my end - since I do not have the same data, but my hunch is that it should be faster.

Apart from that, this is the easiest problem to divide and rule - if you have access to multiple machines or farms, you can just divided up the whole task to multiple machines and finally concat all the outputs into one file. This way command line limits and memory can also be managed.

answered Jul 23, 2015 at 4:11

amisax

3,07318 silver badges23 bronze badges

can you explain the down-vote instead of the eat,shoot and leave ?
– amisax
CommentedJul 23, 2015 at 4:26
1
In most (all?) cases grep is not needed in conjunction with awk. Instead of grep double $f | awk -F '[<>]' '{print $3}' you can just use awk -F '[<>]' '/double/ {print $3}' $f. Also tr` is not really needed here: echo $f$(awk -F '[<>]' '/double/{printf ",%d", $3} END{print ""}' $f
– Adaephon
CommentedJul 23, 2015 at 9:39
1
@mikeserv That may depend on the implementation and the specific case. I did a quick check on some log files (cumulative size around 1.5 GiB, on ramdisk), searching for lines containing "line" (over all around 7.5 million matches) and for me grep was (far) slower. grep 'line' logfiles/* > /dev/null (GNU grep 2.16) takes about 15 seconds to finish, while gawk '/line/{print}' > /dev/null (GNU awk 4.0.1) takes about 8.5 seconds and `mawk '/line/{print} > /dev/null' (Mike's AWK 1.3.3) only needs about 2.9 seconds.
– Adaephon
CommentedJul 24, 2015 at 8:08
1
@mikeserv Sorry, typing error, of course all three commands used logfiles/* as filelist. grep -F 'line' logfiles/* > /dev/null takes even longer with around 30 seconds. While grep -P 'line' logfiles/* > /dev/null (using PCRE) beats at least gawk with a time of 6 seconds. On a completely different machine with newer versions (grep 2.21, gawk 4.1.3, mawk 1.3.4) it looks different: gawk 8.0 s, mawk 4.2 s, grep 2.8 s, grep -F 3.4 s and grep -P 5.2 s. Here grep is indeed faster than awk. So, it really depends. (and grep -F does not help performance)
– Adaephon
CommentedJul 24, 2015 at 12:47
1
@Adaephon - wow. It's ... weird. In my experience awk is always the slowest tool in the box - only a step ahead of the turtles like perl or python, sed is faster, grep still more so, and dd, expand, cut, paste set the pace. This has been true with many different sets of tools that I have used - not just GNU. I'm thoroughly confused/intrigued by your results. Are the commands interpreting charsets in different ways?
– mikeserv
CommentedJul 24, 2015 at 14:27

| Show 7 more comments

Micah Elliott · Accepted Answer · 2015-07-23 04:42:24Z

You’re writing twice for every file. This is probably the most expensive part. You’ll instead want to try to keep the whole thing in memory, probably in an array. Then write once in the end.

Look into ulimit if you start hitting memory limits. If you’re increasing this workload to 10-100x, you’re looking at maybe 10-100 GB of memory. You could batch this in a loop that does so-many-thousands per iteration. I’m not sure if this needs to be a repeatable process, but get more sophisticated if you need it to be faster/more robust. Otherwise, hand stitch the batches afterward.

You’re also spawning several processes per file -- every pipe you have. You could do the whole parsing/munging (grep/sed/tr) with a single process. After the grep, Zsh can handle the other translations via expansions (see man zshexpn). Or, you could do all the single sed line in one invocation with multiple expressions. sed may be faster if you avoid the -r (extended regex) and non-greediness. Your grep could just pull out matching lines from many files at once, and write to intermediate temp files. Know your bottlenecks, though, and don’t fix what’s not.

Stack Exchange Network

Efficient data extraction from multiple files to a single CSV file

Update

5 Answers 5

Explanation:

Explanation:

Explanation

You must log in to answer this question.

Linked

Hot Network Questions

Efficient data extraction from multiple files to a single CSV file

Update

5 Answers 5

Explanation:

Explanation:

Explanation

You must log in to answer this question.

Linked

Related

Hot Network Questions