Sorting an XML file in UNIX with a Bash script?

Question

I am trying to sort my XML file that looks like this by alphabetical order. This is a part of a larger bash script so it needs to work within that script:

<Module> <Settings> <Dimensions> <Volume>13000</Volume> <Width>5000</Width> <Length>2000</Length> </Dimensions> <Stats> <Mean>1.0</Mean> <Max>3000</Max> <Median>250</Median> </Stats> </Settings> <Debug> <Errors> <Strike>0</Strike> <Wag>1</Wag> <MagicMan>0</MagicMan> </Errors> </Debug> </Module>

I want the end result to look like this, I only want the innermost tags to be sorted:

<Module> <Settings> <Dimensions> <Length>2000</Length> <Volume>13000</Volume> <Width>5000</Width> </Dimensions> <Stats> <Max>3000</Max> <Mean>1.0</Mean> <Median>250</Median> </Stats> </Settings> <Debug> <Errors> <MagicMan>0</MagicMan> <Strike>0</Strike> <Wag>1</Wag> </Errors> </Debug> </Module>

I am trying to use sort like this where -t sorts by the > delimeter and then the 4 sorts by the 4th column which would be in the inner but it is not working.

sort -t'>' -k4 file > final.xml

I get funky output that sorts the other columns in with the sorted inner tags.

Any help would be appreciated

You have to use an XML parser to parse XML data. sort is only line-based, so it just can't handle XML. — glenn jackman, CommentedJul 20, 2021 at 22:03
Taking a step backwards for a moment, why do you need the XML file to be sorted? The usual XML parsing tools don't generally need to care — Chris Davies, CommentedJul 20, 2021 at 22:20
Here is an example of what you're looking at; stackoverflow.com/q/9161934/7552 — glenn jackman, CommentedJul 20, 2021 at 22:21
@colinodowd when you say "working on an embedded platform" does that mean you only have the mandatory POSIX toolset (e.g. you have grep, sed, and awk but not perl or any other non-mandatory tools) or something else? Are they the GNU versions of those tools or something else (e.g. what does awk --version output)? — Ed Morton, CommentedJul 20, 2021 at 22:51
xsltproc is a very widely used very widely ported relatively modest resource consumer (if bash runs, it's unlikely xsltproc can't run). One copy of xsltproc plus one question to the XSLT folks would probably get you a tiny XSLT program that will do the job efficiently and correctly (i.e., working no matter what the XML formatting is). Also provides some insurance for any future XML manipulation that may crop up for you. — Ron Burk, CommentedJul 21, 2021 at 0:39

steeldriver · Accepted Answer · 2021-07-21 14:50:59Z

[with a generous assist from Kusalananda]

You can do it using the xq wrapper from yq (a jq wrapper for YAML/XML) to leverage jq's sorting capabilities:

$ xq -x 'getpath([paths(scalars)[0:-1]] | unique | .[]) |= (to_entries|sort_by(.key)|from_entries)' file.xml <Module> <Settings> <Dimensions> <Length>2000</Length> <Volume>13000</Volume> <Width>5000</Width> </Dimensions> <Stats> <Max>3000</Max> <Mean>1.0</Mean> <Median>250</Median> </Stats> </Settings> <Debug> <Errors> <MagicMan>0</MagicMan> <Strike>0</Strike> <Wag>1</Wag> </Errors> </Debug> </Module>

Explanation:

paths(scalars) generates a list of all paths, from root to leaf, then array slice [0,-1] removes the leaf node resulting in a list of paths to the deepest non-leaf nodes:

["Module","Settings","Dimensions"] ["Module","Settings","Dimensions"] ["Module","Settings","Dimensions"] ["Module","Settings","Stats"] ["Module","Settings","Stats"] ["Module","Settings","Stats"] ["Module","Debug","Errors"] ["Module","Debug","Errors"] ["Module","Debug","Errors"]

[paths(scalars)[0:-1]] | unique | .[] puts the list into an array so that it may be de-duplicated by unique. The iterator .[] turns it back to a list:
```
["Module","Debug","Errors"] ["Module","Settings","Dimensions"] ["Module","Settings","Stats"] 
```
getpath() turns the de-duplicated list into bottom-level objects whose contents may be sorted and updated with the |= update-assign operator

The -x option tells xq to convert the result back to XML rather than leaving it as JSON.

Note that while sort works here in place of sort_by(.key) the former implicitly sorts by values as well as keys if the keys are non-unique.

Maybe someone with stronger jq-fu can figure out how (map? with_entries?) to remove some of the duplication... — steeldriver, CommentedJul 20, 2021 at 23:29
.Module[][] |= (to_entries|sort_by(.key)|from_entries) This makes it more explicit that you're sorting by the keys, and sorts all 2nd-layer objects down from .Module. You can't use with_entries() here without rethinking as you will need to have a construct like to_entries|map(something)|from_entries to use that. — Kusalananda, CommentedJul 21, 2021 at 6:16
@Kusalananda thanks I always forget that [] can be used to iterate nested objects, not just arrays. I was trying to do something from "the other end" using paths(scalars)[0:-1] but couldn't make it work. — steeldriver, CommentedJul 21, 2021 at 11:54
We were just lucky that all the keys on the same level in the structure needed sorting. If it had been a more uneven structure to the document, you would have needed to do something like what your code does. — Kusalananda, CommentedJul 21, 2021 at 11:55

Ed Morton · Accepted Answer · 2021-07-21 14:15:33Z

Using any awk, sort, and cut in any shell on every Unix box and assuming your input is always formatted like the sample you provided in your question where the lines to be sorted always have start/end tags and the other lines don't and <s don't appear anywhere else in the input:

$ cat tst.sh #!/usr/bin/env bash awk ' BEGIN { FS="<"; OFS="\t" } { idx = ( (NF == 3) && (pNF == 3) ? idx : NR ) print idx, $0 pNF = NF } ' "${@:--}" | sort -k1,1n -k2,2 | cut -f2-

$ ./tst.sh file <Module> <Settings> <Dimensions> <Length>2000</Length> <Volume>13000</Volume> <Width>5000</Width> </Dimensions> <Stats> <Max>3000</Max> <Mean>1.0</Mean> <Median>250</Median> </Stats> </Settings> <Debug> <Errors> <MagicMan>0</MagicMan> <Strike>0</Strike> <Wag>1</Wag> </Errors> </Debug> </Module>

The above uses awk to decorate the input to sort so that we can just run sort once on the whole file and then use cut to remove the number that awk added. Here are the intermediate steps so you can see what's happening:

awk ' BEGIN { FS="<"; OFS="\t" } { idx = ( (NF == 3) && (pNF == 3) ? idx : NR ) print idx, $0 pNF = NF } ' file 1 <Module> 2 <Settings> 3 <Dimensions> 4 <Volume>13000</Volume> 4 <Width>5000</Width> 4 <Length>2000</Length> 7 </Dimensions> 8 <Stats> 9 <Mean>1.0</Mean> 9 <Max>3000</Max> 9 <Median>250</Median> 12 </Stats> 13 </Settings> 14 <Debug> 15 <Errors> 16 <Strike>0</Strike> 16 <Wag>1</Wag> 16 <MagicMan>0</MagicMan> 19 </Errors> 20 </Debug> 21 </Module>

awk ' BEGIN { FS="<"; OFS="\t" } { idx = ( (NF == 3) && (pNF == 3) ? idx : NR ) print idx, $0 pNF = NF } ' file | sort -k1,1n -k2,2 1 <Module> 2 <Settings> 3 <Dimensions> 4 <Length>2000</Length> 4 <Volume>13000</Volume> 4 <Width>5000</Width> 7 </Dimensions> 8 <Stats> 9 <Max>3000</Max> 9 <Mean>1.0</Mean> 9 <Median>250</Median> 12 </Stats> 13 </Settings> 14 <Debug> 15 <Errors> 16 <MagicMan>0</MagicMan> 16 <Strike>0</Strike> 16 <Wag>1</Wag> 19 </Errors> 20 </Debug> 21 </Module>

Alternatively, using GNU awk for sorted_in:

$ cat tst.awk BEGIN { FS="<" } NF == 3 { rows[$0] f = 1 next } f && (NF < 3) { PROCINFO["sorted_in"] = "@ind_str_asc" for (row in rows) { print row } delete rows f = 0 } { print }

If you don't have GNU awk you can use any awk and any sort for that same approach:

$ cat tst.awk BEGIN { FS="<" } NF == 3 { rows[$0] f = 1 next } f && (NF < 3) { cmd = "sort" for (row in rows) { print row | cmd } close(cmd) delete rows f = 0 } { print }

but it'll be much slower then the first 2 solutions above as it's spawning a subshell to call sort for every block of nested lines.

Ron Burk · Accepted Answer · 2021-07-21 01:46:39Z

Answered as asked: pure(ish) bash solution (still calls sort however). Produces specified output from example input. Fragile, of course, as any solution that treats XML as line-oriented must be.

#!/bin/bash function FunkySort(){ local inputfile="$1" local -a linestosort=() local line ltchars while IFS= read -r line; do # strip all but less-than characters ltchars="${line//[^<]}" # if we guess it is "innermost" tag if [ ${#ltchars} -gt 1 ]; then # append to array linestosort+=("${line}") else # if non-innermost but have accumulated some of them if [ ${#linestosort} -gt 0 ]; then # then emit accumulated lines in sorted order printf "%s\n" "${linestosort[@]}" | sort # and reset array linestosort=() fi printf "%s\n" "$line" fi done < "$inputfile" } FunkySort "test.xml" >"test.out"

Stack Exchange Network

Sorting an XML file in UNIX with a Bash script?

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Sorting an XML file in UNIX with a Bash script?

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions