Change the format of a XML string which have date timestamp in the middle inside a lot of text files in the same folder (*.txt)

Question

I have a lot of text files with dates in XML formated as follows:

<DATA2020-04-13T08:59:05.427 />

Need to change into this:

<DATA>2020-04-13T08:59:05.427</DATA>

Notes:

The date and time vary between strings and they are not to be altered. Before and after this string in each line there is a lot more XML formated stuff. Also using Unix date is not an option, I really need to change the XML string inside the files.

I was thinking to use sed / awk / perl find and replace maybe using wilcards. Can anyone please figure out a way to achieve this?

it looks like broken xml, which is probably why they want to fix it. — cas, CommentedJan 5, 2022 at 8:28
@cas Well, it would be better to fix whatever created that XML instead of adding a post-processing step that may mess up other things. — Kusalananda, CommentedJan 5, 2022 at 8:45
yes, it would. sometimes, though, it's faster and easier to just fix the broken output. especially if it took hours or days to generate or was generated from data that's no longer available. sometimes you just have to deal with the broken crap you have rather than the perfect data you wish you had. — cas, CommentedJan 5, 2022 at 8:54
If it's a one-off, then sure, repair the data using sed or awk. If this is a regular data feed that's arriving on a daily basis, then fix the source to generate real XML rather than constructing a system that immortalises the bug (and breaks on the day that the bug is fixed). — Michael Kay, CommentedJan 5, 2022 at 10:56

cas · Accepted Answer · 2022-01-05 09:25:59Z

$ echo '<DATA2020-04-13T08:59:05.427 />' | sed -E 's/<DATA(20[^/]*) \/>/<DATA>\1<\/DATA>/' <DATA>2020-04-13T08:59:05.427</DATA>

Or, using = as the delimiter instead of /, to avoid having to backslash-escape the /s:

$ echo '<DATA2020-04-13T08:59:05.427 />' | sed -E 's=<DATA(202[^/]*) />=<DATA>\1</DATA>=' <DATA>2020-04-13T08:59:05.427</DATA>

this makes it a little easier to read (of course, you'd now have to escape any = characters in the search pattern and replacement text).

You could use pretty much the same regexes in perl too (the main difference being that while \1 works in perl to refer to capture groups, it's better and more correct to use $1), which has even more options for delimiting the s operator, e.g. matching pairs of { and }:

$ echo '<DATA2020-04-13T08:59:05.427 />' | perl -pe 's{<DATA(202[^/]*) />} {<DATA>$1</DATA>}' <DATA>2020-04-13T08:59:05.427</DATA>

perl also has a /x modifer to ignore whitespace (including newlines) that isn't either escaped with \ or in a bracket expression. It ignores # comments too. The purpose is to make it easier to write more-readable, documented regexes in your code.

See man perlre for details on perl regular expressions.

Thank you that worked perfectly! Problem solved
– user508787
CommentedJan 6, 2022 at 5:12 — user508787, CommentedJan 6, 2022 at 5:12

Ed Morton · Accepted Answer · 2022-01-06 13:00:30Z

Using a sed that supports -E for EREs (e.g. GNU or BSD sed):

$ sed -E 's:<(DATA)([^ ]+) />:<\1>\2</\1>:' file <DATA>2020-04-13T08:59:05.427</DATA>

otherwise using any sed in any shell on every Unix box:

$ sed 's:<\(DATA\)\([^ ]*\) />:<\1>\2</\1>:' file <DATA>2020-04-13T08:59:05.427</DATA>

jubilatious1 · Accepted Answer · 2022-01-08 09:18:01Z

Using Raku (formerly known as Perl_6)

raku -pe 's:g/ \<DATA ( <+[0..9]+[-T:.]>+ ) \s\/\> /{"<DATA>"~$0~"</DATA>"}/;'

OR

raku -pe 's:g[ "<DATA" ( <+[0..9]+[-T:.]>+ ) " />" ] = ["<DATA>"~$0~"</DATA>"];'

Sample Input:

<DATA2020-04-13T08:59:05.427 />

Sample Output:

<DATA>2020-04-13T08:59:05.427</DATA>

Above are answers coded in Raku, a member of the Perl family of programming languages. Both examples above have four general characteristics of note:

No guessing on backslashed characters: if it isn't <alnum> (alphanumeric or underscore) it needs to be escaped,
Regex modifiers like /g for global now go at the head of the s/// form immediately after the s, preceded by a colon. And either s:global or s:g works,
Perl's /x modifier is now the default in Raku (allows liberal whitespace between regex atoms), and
String concatenation in Raku is accomplished with ~ tilde.

Both examples above use an enumerated character class <+[0..9]+[-T:.]>, which very simply consists of the digits [0..9], plus the four characters [-T:.]. Also, while the first example above follows the traditional s/// substitution idiom, the second example above uses Raku's new 'sans-backslash' replacement format (with an = equals sign in-between), which some readers may find to be more readable.

Finally, if you have any interest in DateTime extraction/modification, Raku has you covered:

~$ echo '<DATA2020-04-13T08:59:05.427 />' | raku -pe 's:g[ "<DATA" ( <+[0..9]+[-T:.]>+ ) " />" ] = [DateTime($0~"Z")];' 2020-04-13T08:59:05.427000Z ~$ echo '<DATA2020-04-13T08:59:05.427 />' | raku -pe 's:g[ "<DATA" ( <+[0..9]+[-T:.]>+ ) " />" ] = [DateTime(now) - DateTime($0~"Z")];' 54862286.622457

https://docs.raku.org/language/regexes#Enumerated_character_classes_and_ranges
https://docs.raku.org/routine/DateTime
https://raku.org

Stack Exchange Network

Change the format of a XML string which have date timestamp in the middle inside a lot of text files in the same folder (*.txt)

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Change the format of a XML string which have date timestamp in the middle inside a lot of text files in the same folder (*.txt)

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions