It looks like you can do a mergesort
and prune it. Basically sort
just assumes you know what you're doing and runs a single pass over two or more inputs, interleaving them as their lexicographic sort order converges.
Here's what a GNU -m
erge sort
prints for your example:
<app> <app> <bbb> <bbb> <jjj>test1</jjj> </bbb> <bbb> <jjj>test2</jjj> </bbb> <bbb> <jjj>test2</jjj> </bbb> </app> <jjj>test3</jjj> </bbb> <bbb> <jjj>test4</jjj> </bbb> </app>
So at least its all folded in now, but, like I said, you still have to prune it. This sed
script will do it for your examples:
sort -m /tmp/xml[12] | sed -ne:n -e'$!s|/a..> *$|bbb>|;$p' \ -e'\|^[^>]*b.*\n|{N;P;D;}' \ -eN -e's|\(.*\)\n\(.*\n\)* *\1 *$|\1|' \ -e's|\n|&|3;tD' -ebn -e:D -eP\;D
It just ensures its got at least three lines stacked as it works through input and compares the first line in the stack against the last when the first line isnt a <bbb>
tag.
<app> <bbb> <jjj>test1</jjj> </bbb> <bbb> <jjj>test2</jjj> </bbb> <bbb> <jjj>test3</jjj> </bbb> <bbb> <jjj>test4</jjj> </bbb> </app>