5
\$\begingroup\$

I have written a Regex to validate XML/HTML, along with any attributes. It aims to:

  • Match any XML-like text
  • Not match any unclosed tags
  • Adapt for any spacing, newlines, etc.
  • Be as generous as possible: match as much as possible.
  • Match self-closing tags (in edit)

I initially created this to validate API requests I was making, to make sure the request was successfully completed without missing bytes. I am aware this is ridiculously stupid and ineffective method to do this, but it's more of an experiment with Regex. I'm not interested in being told to use a different method, that's not what I'm aiming to do. However, now I'm wondering if:

  • There is any smart way to make it shorter
  • Known bug: you have to escape both types of speech mark all the time: i.e placeholder="i 'love' regex" will not be matched, you have to do placeholder="i \'love\' regex", despite that fact that it's a ' inside a ".fixed with a conditional operator, which I discovered by accident in the Regexr cheatsheet.
    • Known bug with this fix: seems to work with "'", but not with '"'. Even weirder, sometimes it fails when the speech marks inside a tag are of different types: e.g, <input placeholder="wait, 'no'?" /> works fine, but <input type='text' placeholder="wait, 'no'?" /> (note the use of ' the first time) doesn't.
  • There is some easy way to allow for implicitly closed tags, e.g <input> without having to do a closing tag - <input type="text"></input> (is currently matched) vs <input type="text" /> (not matched by this Regex)Done! See revision history for previous version without this feature.

I'm very new to Regex, and especially to ?R recursion, so I'm not sure how to go about most of these points. Of course, other general comments are also appreciated. Note that this will not run in normal JS browser Regex, so try either with the regex (not re) module in Python, or using the PCRE flavour on RegExr (link goes to test for my Regex).

Anyway, here's the current Regex:

<([\w!:[\]-]+)( +[\w!:[\]-]+( ?= ?((\d+)|(((')|")([ !#-&(-~]|(?(?=\8)"|')|\\\7)*([ !#-&(-[\]-~]|(?(?=\8)"|')|\\\7)\7)))?)*((>([^<])*((?>([^<])*(?R))*)([^<])*<\/\1>)|(\s*\/\s*>)) 

Or, formatted:

< # Open the tag ([\w!:[\]-]+) # Match the tag name, save to capture group ( # Define group to match attributes \ + # Match any amount of whitespace, between the tag name and attribute name [\w!:[\]-]+ # Match the attribute name ( # Define group to match attribute value \ ? # Match any amount of whitespace, between the attribute name and equals sign = # Match the equals \ ? # Match any amount of whitespace, between the equals sign and the attribute value ( # Define a group for the attribute value (\d+) # Match a number... | # ... or ... ( # ... match a string ( # Match speech mark, and record speech mark type in capture group (") # Specifically record if it is a double quote for later | # but also match (and save) ' # single quotes. ) ( # Define a group to match string contents [ !#-&(-~] # Match any character, excluding all quotes | # ... or ... (? # define a conditional to... (?= # check if... \8 # the double quote group is empty, i.e it is a not a double quote, i.e it is a single quote ) " # if True, allow for inclusion of double quotes in the string (since 'text"text' is fine, but 'text'text' is not) | # if False, ' # allow for inclusion of single quotes in the string (since "text'text" is fine, but "text"text" is not) ) | # ... or ... \\ # Match an _escaped_... \7 # speech mark ) * # Match as many as possible ( # Create a group to validate last character, to avoid it being a backslash (see (1) below) [ !#-&(-[\]-~] # Match any character, _except_ backslash | # ... or ... (? # define a conditional to... (?= # check if... \8 # the double quote group is empty, i.e it is a not a double quote, i.e it is a single quote ) " # if True, allow for inclusion of double quotes in the string (since 'text"text' is fine, but 'text'text' is not) | # if False, ' # allow for inclusion of single quotes in the string (since "text'text" is fine, but "text"text" is not) ) | # ... or ... \\ # Match an _escaped_... \7 # speech mark ) \7 # Match closing speech mark ) ) ) # End attribute value group ? # Make attribute value optional, to allow for boolean attributed - e.g `<input type="text" disabled></input>` ) * # Allow for as many attributes as possible ( # Set for choosing between two possible endings for the name/attributes, which currently looks like '<tagname attr="text"' - either, '>texthere</tagname>', or simply '/>' ( # First possible ending, with text/nesting inside tag, and closing tag at end > # End opening tag ([^<]) # Allow for text in tag, before other nested tags * ( # Open recursion contents group ( # Define atomic recursion group ?> # Make group atomic ([^<]) * # Allow as much text as required _between_ tags (e.g allow <a>some text <span>some more text</span>__This line allows for this text here!__<span>other text</span>also text</a>) (?R) # Call recursion group ) * # Allow recursion group to loop as many times as required ) ([^<]) # Allow for text in tag, after all other nested tags but before closing tag * < # Open closing tag \/ # Match '/' character \1 # Refer back to tag name > # End closing tag ) | # Or, ( # second possible ending, with a self-closing tag - simply match '/>' \s # allow for spaces before '/' * \/ # Match '/' \s # Allow for spaces after '/', but before '>' * > # Match '>' ) ) 

(1): I'd also like to optimise the first string-value capture group to automatically not match backslashes on the last character, but I can't figure out a way without repeating the entire group just without the backslash character.

Any help or questions are appreciated. Like I said, I'm quite new to Regexes, so if there are some unspoken rules/standards I've missed, I'd be happy to hear them!

\$\endgroup\$
6
  • 4
    \$\begingroup\$See RegEx match open tags and also html.parser\$\endgroup\$
    – AJNeufeld
    CommentedAug 23, 2019 at 19:19
  • \$\begingroup\$@AJNeufeld I realise this isn't the most efficient method; in my original implementation I did in fact use html.parser in an array to track opening and closing tags. However, before I realised the existance of the module, I tried with Regex, so I'm wondering, assuming the module didn't exist, how well this regex would hold up.\$\endgroup\$CommentedAug 23, 2019 at 19:27
  • 2
    \$\begingroup\$Well, as you noticed with self-closing tags, it doesn't hold up. Beyond that, <![CDATA[...]]> would likely bring it to its knees.\$\endgroup\$
    – AJNeufeld
    CommentedAug 23, 2019 at 19:34
  • \$\begingroup\$@AJNeufeld I've improved it now to allow for self-closing tags, but I don't understand what you mean by the <![CDATA[...]]> example. As long as the tag is closed, it should work fine with whatever you put in it.\$\endgroup\$CommentedAug 24, 2019 at 14:39
  • 1
    \$\begingroup\$@ggorlen, regexes are not regular expressions. Exceedingly few regex libraries are limited to matching regular languages. The proof of impossibility you reference is emphatically not for a regex with explicit recursion capabilities.\$\endgroup\$CommentedAug 24, 2019 at 16:45

1 Answer 1

3
\$\begingroup\$

It is simply not feasible to just parse the full set of (X|HT)ML with regex. I can't provide too much feedback on your solution as it simply isn't the correct solution for the problem, but I can provide plenty of examples where it fails to match valid input.

  1. <tag attr='"'></tag>
  2. <tag attr=""></tag>
  3. <tag attr=attr></tag> - fine in HTML
  4. <tag><!-- html comment --></tag>
  5. <tag> </tag >

You also need to decide - XML or HTML? Each brings its own set of problems. How should this be parsed? <tag><![CDATA[</tag>]]></tag>. In XML, this is the equivalent of <tag>&lt;/tag&gt;</tag>. In HTML, <![CDATA[ has no special meaning, and thus the outcome is different.


While it is possible to write a full XML/HTML parser with regex (the grammar hierarchies argument applies only if your regex is limited to a regular language, and most aren't) it really isn't a good idea. If you want to see an attempt which covers far more than yours take a look at this SO answer

\$\endgroup\$
3
  • \$\begingroup\$OK, I've added an edit to fix cases 2, 4 and 5. I'll do case 3 in a moment, when I can figure out the best way (having to keep changing the reference numbers to capture groups by hand is a pain). I'd appreciate any help with no. 1 - while I have noticed it, I can't figure out why it's not working.\$\endgroup\$CommentedAug 24, 2019 at 17:36
  • 1
    \$\begingroup\$@Geza, please see the meta site for acceptable ways to post improvements. I've rolled back the changes you made after this answer was posted.\$\endgroup\$CommentedAug 24, 2019 at 18:49
  • \$\begingroup\$@PeterTaylor OK, see the edit I've just made.\$\endgroup\$CommentedAug 24, 2019 at 19:03

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.