7

Is there a Unix/Linux command that can turn this:

AMERICA USA NEW_YORK AB-100 AMERICA USA NEW_YORK VF-200 AMERICA USA NEW_YORK XY-243 AMERICA USA LOS_ANGELES UH-198 AMERICA CANADA TORONTO UT-876 AMERICA CANADA TORONTO UT-877 AMERICA CANADA VANCOUVER UT-871 AMERICA CANADA VANCOUVER UT-872 AMERICA CANADA VANCOUVER UT-873 AMERICA MEXICO MEXICO OU-098 AMERICA MEXICO MONTERREY OU-099 AMERICA MEXICO MONTERREY OU-100 EUROPE FRANCE PARIS IV-122 EUROPE FRANCE PARIS AV-112 EUROPE FRANCE PARIS IF-111 EUROPE FRANCE PARIS XX-190 EUROPE FRANCE TOULOUSE TL-654 

Into this:

AMERICA USA NEW_YORK AB-100 VF-200 XY-243 LOS_ANGELES UH-198 CANADA TORONTO UT-876 UT-877 VANCOUVER UT-871 UT-872 UT-873 MEXICO MEXICO OU-098 MONTERREY OU-099 OU-100 EUROPE FRANCE PARIS IV-122 AV-112 IF-111 XX-190 TOULOUSE TL-654 
3
  • Why using bash? Can we use different tools? Bash can do this but it will be needlessly complex.
    – terdon
    CommentedJun 11, 2015 at 16:01
  • Given the spacing in your data I suppose the input fields are TAB separated?
    – Janis
    CommentedJun 12, 2015 at 3:59
  • 1
    @yes, it's TAB separated.CommentedJun 12, 2015 at 6:56

3 Answers 3

8

For your example:

dir=$(mktemp -d) sed 's|\t|/|g' file | while read -r line; do mkdir -p "$dir/$line"; done (cd "$dir"; tree) rm -r "$dir" 

Output:

 . ├── AMERICA │   ├── CANADA │   │   ├── TORONTO │   │   │   ├── UT-876 │   │   │   └── UT-877 │   │   └── VANCOUVER │   │   ├── UT-871 │   │   ├── UT-872 │   │   └── UT-873 │   ├── MEXICO │   │   ├── MEXICO │   │   │   └── OU-098 │   │   └── MONTERREY │   │   ├── OU-099 │   │   └── OU-100 │   └── USA │   ├── LOS_ANGELES │   │   └── UH-198 │   └── NEW_YORK │   ├── AB-100 │   ├── VF-200 │   └── XY-243 └── EUROPE └── FRANCE ├── PARIS │   ├── AV-112 │   ├── IF-111 │   ├── IV-122 │   └── XX-190 └── TOULOUSE └── TL-654 
3
  • 1
    Now that's sneaky, well done! I suggest using dir=$(mktemp -d);sed 's| \+|/|g' file | while read -r line; do mkdir -p "$dir/$line"; done; tree "$dir" | sed 's/[│└─├]\+/ /g'; rm -r "$dir" to i) avoid the needless cds and ii) remove the indentation characters.
    – terdon
    CommentedJun 11, 2015 at 18:38
  • 1
    @Cyrus; Interesting idea! -Though I'd not create a lot of directories where all we need is just text-processing. - Note also that the single cd in your code is wrong; it works only if you happen to have been in your HOME directory when starting the code sequence. Either use cd - or perform cd "$dir"; tree in a subshell (cd "$dir"; tree) and remove the final cd.
    – Janis
    CommentedJun 12, 2015 at 4:11
  • @Janis: I've updated my answer with your hint to use a subshell.
    – Cyrus
    CommentedJun 12, 2015 at 5:52
5

An awk script that works with standard awks, keeps the lines in its original order, and works with an arbitrary number of columns in the data:

awk -F $'\t' ' function indent (n, i) { for (i=1; i<=n; i++) printf "\t" } { for (i=1; i<=NF; i++) if ($i != o[i]) { printf "%s%s\n", indent(i-1), $i o[i] = $i } } ' 


The same program logic can be implemented in shell. For bash:

indent () { ind=$( printf "%*s" "$1" '' ) ; printf "${ind// /$'\t'}" ;} while IFS=$'\t' read -a f do for ((i=0; i<${#f}; i++)) do if [[ "${f[i]}" != "${o[i]}" ]] then printf "%s%s\n" "$( indent "$i" )" "${f[i]}" o[i]=${f[i]} fi done done 

Note: For ksh you have to adjust read -a by read -A.

    2

    In awk, you could do:

    $ awk '{ a[$1][$2][$3] ? a[$1][$2][$3]=a[$1][$2][$3]"\n\t\t\t"$4 : a[$1][$2][$3]="\t\t\t"$4 ; } END{ for(cont in a){ printf "%s\n", cont; for(count in a[cont]){ printf "\t%s\n", count; for(city in a[cont][count]){ print "\t\t"city"\n"a[cont][count][city] }}}}' file EUROPE FRANCE TOULOUSE TL-654 PARIS IV-122 AV-112 IF-111 XX-190 AMERICA USA NEW_YORK AB-100 VF-200 XY-243 LOS_ANGELES UH-198 CANADA VANCOUVER UT-871 UT-872 UT-873 TORONTO UT-876 UT-877 MEXICO MEXICO OU-098 MONTERREY OU-099 OU-100 

    In Perl:

    perl -lane 'push @{$k{$F[0]}{$F[1]}{$F[2]}},"\t\t\t".$F[3]; END{ for $cont (keys(%k)){ print "$cont"; for $coun (keys(%{$k{$cont}})){ print "\t$coun"; for $city (keys(%{$k{$cont}{$coun}})){ print "\t\t$city\n", join "\n",@{$k{$cont}{$coun}{$city}} }}}}' file EUROPE FRANCE PARIS XX-190 XX-190 TOULOUSE TL-654 AMERICA USA NEW_YORK XY-243 XY-243 LOS_ANGELES UH-198 MEXICO MONTERREY OU-100 OU-100 MEXICO OU-098 CANADA VANCOUVER UT-873 UT-873 TORONTO UT-877 UT-877 
    8
    • awk: line 2: syntax error at or near [ ...in UbuntuCommentedJun 11, 2015 at 16:25
    • @user1598390 that's odd. Did you copy/paste directly into your shell or did you copy it manually?
      – terdon
      CommentedJun 11, 2015 at 16:32
    • @user1598390 see updated answer. There were some bugs in the awk code (though they wouldn't have given that error) and I also added Perl solution.
      – terdon
      CommentedJun 11, 2015 at 17:11
    • The perl version runs but prints 2 items at most at the end of a branch, for example new york has three items but it only prints two and those two are the last one repeated.CommentedJun 11, 2015 at 18:36
    • @user1598390 sorry again, another bug. The new version should work. I'm still curious about the awk though. I just tested again and it works fine.
      – terdon
      CommentedJun 11, 2015 at 21:58

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.