Stack data from multiple delimited files into one, with variable columns

Question

I'm new to scripting. Got struck with a file merge issue in Unix. Was looking for some direction and stumbled upon this site. I saw many great posts and replies but couldn't find a solution to my issue. Greatly appreciate any help..

I have three csv files -> Apex_10_Latest.csv, Apex_20_Latest.csv, Apex_30_Latest.csv. Number of columns are varying in these 3 files. Typically the latest file, based on the numbering, might have some new columns appended to the end. So I want to take the latest header and stack the data from all the 3 files into a new file Apex.csv. When stacking the data from older file which might have less columns than latest file, I want the data to be populated as null with appropriate delimiters..

Also this has to be done recursively for a multiple set of files (3 each), all in the same folder. - Apex_10_Latest.csv,Apex_20_Latest.csv,Apex_30_Latest.csv - merged into Apex.csv - Code_10_Latest.csv,Code_20_Latest.csv,Code_30_Latest.csv - merged into Code.csv - Trans_10_Latest.csv,Trans_20_Latest.csv,Trans_30_Latest.csv - merged into Trans.csv

Following is the format of the source files and expected target file... SOURCE FILES:

Apex_30_Latest.csv:
A,B,C,D
1,2,3,4
2,3,4,5
3,4,5,6
Apex_20_Latest.csv:
A,B,C
4,5,6
5,6,7
6,7,8
Apex_10_Latest.csv:
A,B
7,8
8,9
9,10

EXPECTED TARGET FILE:

Apex.csv
A,B,C,D
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,,
5,6,7,,
6,7,8,,
7,8,,,
8,9,,,
9,10,,,

Thanks...

Welcome to stackexchange! Great first question. I believe the commands cut and paste can be configured to do what you want, but you may need to use a more fully featured text processing tool such as awk. — Wildcard, CommentedOct 23, 2015 at 2:34
A question: Do the headers always start the same? Or might you have A,B,D in one file? Or C,D,E? In other words, is column 3 in one file the same as column 3 in every other file that has 3 columns? (And for every other number in place of 3?) — Wildcard, CommentedOct 23, 2015 at 5:37

aborruso · Accepted Answer · 2019-01-06 17:40:41Z

with Miller (http://johnkerl.org/miller/doc/) as usual is very easy

mlr --csv unsparsify Apex_*_Latest.csv

gives you

A,B,C,D 1,2,3,4 2,3,4,5 3,4,5,6 4,5,6, 5,6,7, 6,7,8, 7,8,, 8,9,, 9,10,,

jijinp · Accepted Answer · 2015-10-23 09:02:38Z

 cat $(ls -1 Apex_*_Latest.csv | sort -nr -k2 -t'_') | awk -F"," '{ if (NR==1){ nfm=NF}; for (i=1;i<=nfm;i++) { printf $i","}; print ""}' >Apex.csv

You can reverse sort filenames based on second field (30,20,10..) and cat the files so that lines with highest no of columns comes first.

Then with awk you can get the highest no of columns NF from first line NRif (NR==1){nfm=NF}

Then Run a for loop till i (column number) greater or equal nfm to print values in the field no i followed by ','. if there is no value for ith field(happen when columns are less than latest files )it will print just ,.

You can remove everything on the first line up to awk and insert $(ls -r Apex_*_latest.csv) on the last line between the last single quote ' and the >. That would eliminate cat and sort, be less code, and easier to understand. — RobertL, CommentedOct 26, 2015 at 3:24

RobertL · Accepted Answer · 2015-10-24 22:25:07Z

I think the previous answer is the best, I just show a different approach since I haven't used awk in years, since perl and python became big. I think awk is fine, it's just that a mixture of shell, sed, python and/or perl has suited my work better.

However, in this case, I think anyone can see that the awk solution is more succinct and easier to read. Come to think of it, I think I've heard awk referred to as the command line spreadsheet, or something like that. :-)

Based on the original post, I've chosen to let the ls command sort the filenames by file modification time, as opposed to relying on the filename format. Six of one, half a dozen of the other.

So just for comparison, here's an efficient, portable, modular(?!), pure shell version of the solution:

 #!/bin/sh get_commas() { sed 's/[^,]//g; 1q' "$@" } get_extra_commas() { local maxcommas="$1" local file="$2" local new_commas=$(get_commas "$file") local extra_commas="" while [ "${new_commas}${extra_commas}" != "${maxcommas}" ] do extra_commas=",$extra_commas" done echo "$extra_commas" } unset header ls -t Apex*.csv | while read filename do if [ -z "$header" ] then header="$(sed 1q "$filename")" commas=$(echo "$header" | get_commas) echo "$header" fi extra_commas=$(get_extra_commas $commas "$filename") sed "1d; s/\$/$extra_commas/" "$filename" done

The analogies that have stuck with me from a variety of sources are: sed is the assembly language of text processing; awk is the command line spreadsheet tool; perl is the omnipotent god of text processing. :) — Wildcard, CommentedOct 27, 2015 at 6:19

John Kerl · Accepted Answer · 2016-11-12 02:16:50Z

Here is an answer implemented in Miller:

$ cat rect.mlr for (k,v in $*) { @fields[k] = v; # retain already-seen field names } for (k,v in @fields) { if (isabsent($[k])) { $[k] = ""; } } $ mlr --csvlite put -f rect.mlr Apex_30_Latest.csv Apex_20_Latest.csv Apex_10_Latest.csv A,B,C,D 1,2,3,4 2,3,4,5 3,4,5,6 4,5,6, 5,6,7, 6,7,8, 7,8,, 8,9,, 9,10,,

Since Miller handles named columns intrinsically, header-line management becomes simpler.

Stack Exchange Network

Stack data from multiple delimited files into one, with variable columns

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Stack data from multiple delimited files into one, with variable columns

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions