0

I have a lot of text files with dates in XML formated as follows:

<DATA2020-04-13T08:59:05.427 /> 

Need to change into this:

<DATA>2020-04-13T08:59:05.427</DATA> 

Notes:

The date and time vary between strings and they are not to be altered. Before and after this string in each line there is a lot more XML formated stuff. Also using Unix date is not an option, I really need to change the XML string inside the files.

I was thinking to use sed / awk / perl find and replace maybe using wilcards. Can anyone please figure out a way to achieve this?

10
  • 1
    Please post what you've already tried yourself.
    – raspi
    CommentedJan 5, 2022 at 8:00
  • 3
    it looks like broken xml, which is probably why they want to fix it.
    – cas
    CommentedJan 5, 2022 at 8:28
  • 2
    @cas Well, it would be better to fix whatever created that XML instead of adding a post-processing step that may mess up other things.
    – Kusalananda
    CommentedJan 5, 2022 at 8:45
  • 2
    yes, it would. sometimes, though, it's faster and easier to just fix the broken output. especially if it took hours or days to generate or was generated from data that's no longer available. sometimes you just have to deal with the broken crap you have rather than the perfect data you wish you had.
    – cas
    CommentedJan 5, 2022 at 8:54
  • 2
    If it's a one-off, then sure, repair the data using sed or awk. If this is a regular data feed that's arriving on a daily basis, then fix the source to generate real XML rather than constructing a system that immortalises the bug (and breaks on the day that the bug is fixed).CommentedJan 5, 2022 at 10:56

3 Answers 3

1
$ echo '<DATA2020-04-13T08:59:05.427 />' | sed -E 's/<DATA(20[^/]*) \/>/<DATA>\1<\/DATA>/' <DATA>2020-04-13T08:59:05.427</DATA> 

Or, using = as the delimiter instead of /, to avoid having to backslash-escape the /s:

$ echo '<DATA2020-04-13T08:59:05.427 />' | sed -E 's=<DATA(202[^/]*) />=<DATA>\1</DATA>=' <DATA>2020-04-13T08:59:05.427</DATA> 

this makes it a little easier to read (of course, you'd now have to escape any = characters in the search pattern and replacement text).


You could use pretty much the same regexes in perl too (the main difference being that while \1 works in perl to refer to capture groups, it's better and more correct to use $1), which has even more options for delimiting the s operator, e.g. matching pairs of { and }:

$ echo '<DATA2020-04-13T08:59:05.427 />' | perl -pe 's{<DATA(202[^/]*) />} {<DATA>$1</DATA>}' <DATA>2020-04-13T08:59:05.427</DATA> 

perl also has a /x modifer to ignore whitespace (including newlines) that isn't either escaped with \ or in a bracket expression. It ignores # comments too. The purpose is to make it easier to write more-readable, documented regexes in your code.

See man perlre for details on perl regular expressions.

1
  • Thank you that worked perfectly! Problem solvedCommentedJan 6, 2022 at 5:12
0

Using a sed that supports -E for EREs (e.g. GNU or BSD sed):

$ sed -E 's:<(DATA)([^ ]+) />:<\1>\2</\1>:' file <DATA>2020-04-13T08:59:05.427</DATA> 

otherwise using any sed in any shell on every Unix box:

$ sed 's:<\(DATA\)\([^ ]*\) />:<\1>\2</\1>:' file <DATA>2020-04-13T08:59:05.427</DATA> 
    0

    Using Raku (formerly known as Perl_6)

    raku -pe 's:g/ \<DATA ( <+[0..9]+[-T:.]>+ ) \s\/\> /{"<DATA>"~$0~"</DATA>"}/;' 

    OR

    raku -pe 's:g[ "<DATA" ( <+[0..9]+[-T:.]>+ ) " />" ] = ["<DATA>"~$0~"</DATA>"];' 

    Sample Input:

    <DATA2020-04-13T08:59:05.427 /> 

    Sample Output:

    <DATA>2020-04-13T08:59:05.427</DATA> 

    Above are answers coded in Raku, a member of the Perl family of programming languages. Both examples above have four general characteristics of note:

    1. No guessing on backslashed characters: if it isn't <alnum> (alphanumeric or underscore) it needs to be escaped,

    2. Regex modifiers like /g for global now go at the head of the s/// form immediately after the s, preceded by a colon. And either s:global or s:g works,

    3. Perl's /x modifier is now the default in Raku (allows liberal whitespace between regex atoms), and

    4. String concatenation in Raku is accomplished with ~ tilde.

    Both examples above use an enumerated character class <+[0..9]+[-T:.]>, which very simply consists of the digits [0..9], plus the four characters [-T:.]. Also, while the first example above follows the traditional s/// substitution idiom, the second example above uses Raku's new 'sans-backslash' replacement format (with an = equals sign in-between), which some readers may find to be more readable.

    Finally, if you have any interest in DateTime extraction/modification, Raku has you covered:

    ~$ echo '<DATA2020-04-13T08:59:05.427 />' | raku -pe 's:g[ "<DATA" ( <+[0..9]+[-T:.]>+ ) " />" ] = [DateTime($0~"Z")];' 2020-04-13T08:59:05.427000Z ~$ echo '<DATA2020-04-13T08:59:05.427 />' | raku -pe 's:g[ "<DATA" ( <+[0..9]+[-T:.]>+ ) " />" ] = [DateTime(now) - DateTime($0~"Z")];' 54862286.622457 

    https://docs.raku.org/language/regexes#Enumerated_character_classes_and_ranges
    https://docs.raku.org/routine/DateTime
    https://raku.org

      You must log in to answer this question.

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.