I have written a Regex to validate XML/HTML, along with any attributes. It aims to:
- Match any XML-like text
- Not match any unclosed tags
- Adapt for any spacing, newlines, etc.
- Be as generous as possible: match as much as possible.
- Match self-closing tags (in edit)
I initially created this to validate API requests I was making, to make sure the request was successfully completed without missing bytes. I am aware this is ridiculously stupid and ineffective method to do this, but it's more of an experiment with Regex. I'm not interested in being told to use a different method, that's not what I'm aiming to do. However, now I'm wondering if:
- There is any smart way to make it shorter
Known bug: you have to escape both types of speech mark all the time: i.efixed with a conditional operator, which I discovered by accident in the Regexr cheatsheet.placeholder="i 'love' regex"
will not be matched, you have to doplaceholder="i \'love\' regex"
, despite that fact that it's a'
inside a"
.- Known bug with this fix: seems to work with
"'"
, but not with'"'
. Even weirder, sometimes it fails when the speech marks inside a tag are of different types: e.g,<input placeholder="wait, 'no'?" />
works fine, but<input type='text' placeholder="wait, 'no'?" />
(note the use of'
the first time) doesn't.
- Known bug with this fix: seems to work with
There is some easy way to allow for implicitly closed tags, e.gDone! See revision history for previous version without this feature.<input>
without having to do a closing tag -<input type="text"></input>
(is currently matched) vs<input type="text" />
(not matched by this Regex)
I'm very new to Regex, and especially to ?R
recursion, so I'm not sure how to go about most of these points. Of course, other general comments are also appreciated. Note that this will not run in normal JS browser Regex, so try either with the regex
(not re
) module in Python, or using the PCRE flavour on RegExr (link goes to test for my Regex).
Anyway, here's the current Regex:
<([\w!:[\]-]+)( +[\w!:[\]-]+( ?= ?((\d+)|(((')|")([ !#-&(-~]|(?(?=\8)"|')|\\\7)*([ !#-&(-[\]-~]|(?(?=\8)"|')|\\\7)\7)))?)*((>([^<])*((?>([^<])*(?R))*)([^<])*<\/\1>)|(\s*\/\s*>))
Or, formatted:
< # Open the tag ([\w!:[\]-]+) # Match the tag name, save to capture group ( # Define group to match attributes \ + # Match any amount of whitespace, between the tag name and attribute name [\w!:[\]-]+ # Match the attribute name ( # Define group to match attribute value \ ? # Match any amount of whitespace, between the attribute name and equals sign = # Match the equals \ ? # Match any amount of whitespace, between the equals sign and the attribute value ( # Define a group for the attribute value (\d+) # Match a number... | # ... or ... ( # ... match a string ( # Match speech mark, and record speech mark type in capture group (") # Specifically record if it is a double quote for later | # but also match (and save) ' # single quotes. ) ( # Define a group to match string contents [ !#-&(-~] # Match any character, excluding all quotes | # ... or ... (? # define a conditional to... (?= # check if... \8 # the double quote group is empty, i.e it is a not a double quote, i.e it is a single quote ) " # if True, allow for inclusion of double quotes in the string (since 'text"text' is fine, but 'text'text' is not) | # if False, ' # allow for inclusion of single quotes in the string (since "text'text" is fine, but "text"text" is not) ) | # ... or ... \\ # Match an _escaped_... \7 # speech mark ) * # Match as many as possible ( # Create a group to validate last character, to avoid it being a backslash (see (1) below) [ !#-&(-[\]-~] # Match any character, _except_ backslash | # ... or ... (? # define a conditional to... (?= # check if... \8 # the double quote group is empty, i.e it is a not a double quote, i.e it is a single quote ) " # if True, allow for inclusion of double quotes in the string (since 'text"text' is fine, but 'text'text' is not) | # if False, ' # allow for inclusion of single quotes in the string (since "text'text" is fine, but "text"text" is not) ) | # ... or ... \\ # Match an _escaped_... \7 # speech mark ) \7 # Match closing speech mark ) ) ) # End attribute value group ? # Make attribute value optional, to allow for boolean attributed - e.g `<input type="text" disabled></input>` ) * # Allow for as many attributes as possible ( # Set for choosing between two possible endings for the name/attributes, which currently looks like '<tagname attr="text"' - either, '>texthere</tagname>', or simply '/>' ( # First possible ending, with text/nesting inside tag, and closing tag at end > # End opening tag ([^<]) # Allow for text in tag, before other nested tags * ( # Open recursion contents group ( # Define atomic recursion group ?> # Make group atomic ([^<]) * # Allow as much text as required _between_ tags (e.g allow <a>some text <span>some more text</span>__This line allows for this text here!__<span>other text</span>also text</a>) (?R) # Call recursion group ) * # Allow recursion group to loop as many times as required ) ([^<]) # Allow for text in tag, after all other nested tags but before closing tag * < # Open closing tag \/ # Match '/' character \1 # Refer back to tag name > # End closing tag ) | # Or, ( # second possible ending, with a self-closing tag - simply match '/>' \s # allow for spaces before '/' * \/ # Match '/' \s # Allow for spaces after '/', but before '>' * > # Match '>' ) )
(1): I'd also like to optimise the first string-value capture group to automatically not match backslashes on the last character, but I can't figure out a way without repeating the entire group just without the backslash character.
Any help or questions are appreciated. Like I said, I'm quite new to Regexes, so if there are some unspoken rules/standards I've missed, I'd be happy to hear them!
html.parser
in an array to track opening and closing tags. However, before I realised the existance of the module, I tried with Regex, so I'm wondering, assuming the module didn't exist, how well this regex would hold up.\$\endgroup\$<![CDATA[...]]>
would likely bring it to its knees.\$\endgroup\$<![CDATA[...]]>
example. As long as the tag is closed, it should work fine with whatever you put in it.\$\endgroup\$