Normalizing strings using PHP preg_replace and regex expressions

Question

I needed a way to normalize nearly any given input for a configuration parameter on a dashboard. Server side code is in PHP so I wrote a test script to experiment with preg_replace and regex. When I say normalize meaning, the resulting value must match the following restrictions:

Take any cased value and return all lower case. For example, given "ValiDVoicE", after normalizing, "validvoice" would be returned.
Strip away all special characters and white space except underscores. For example, given "@@Valid _voice" return only "valid_voice".
Trim all nonessential white space from front and end of string. For example, given " (&& (*Paxus-Demo " return "paxus_demo".

Solution: (I copied my test script below)

$myvar = [ "Paxus-Demo", "Paxus Demo", "paxus_demo", "Paxus Demo", "paxus-Demo", "paxus_ Demo", "Paxus _Demo", "*(&*& (*Paxus-Demo ", "@@Valid _voice", "Valid-Voice", "PortaL-Demo", "gui_demo", "Gui-Demo ", "VoiceInstance", "vAlid Voice", " vaLid_ _voiCe ", ]; for( $i = 0; $i < sizeof( $myvar ); ++$i ) { $modded = trim( strtolower( preg_replace( array( '/[^a-zA-Z0-9 ]/i', '<\W+>'), array( ' ', '_' ), $myvar[$i] ) ), '_' ); echo "modded2 = " . $modded." [ ". $i ." ] = " .$myvar[$i]. "<br />"; }

My question is not necessarily for elegance, but could I have done this all in one regex expression using lookarounds? Bear in mind, I never took the time to truly understand using regex before today so my knowledge of using lookarounds is still a bit wonky. That said, if someone can simplify the use of lookarounds it'll be greatly appreciated.

Can you tell us what the purpose of this exercise is? To what end do you do this normalization? Are you trying to recognize what's being entered? Is it for a search engine? Etc. If you give us just a bit of a wider view of what you're trying to accomplish we can be a lot more helpful and creative. Only nerds like to struggle with your regular expressions! :-) — KIKO Software, CommentedMar 26, 2015 at 21:24
It's for a field on the configuration side of a dashboard that drives dynamic screen-sets. The field in question is a new field that's being added and will store an Installation ID which will then be used along with the generated sessions so that multiple instances of the dashboard can be run with the same user, but with different sessions+instance_id combinations. The reason for multiple instances is so that we can unify the different instances of the dashboard that gets installed at customer locations instead of having multiple copies of the core database on our db server. — David Webb, CommentedMar 27, 2015 at 15:57
The reason for normalizing is so the installation ID's have a standardized format so that they can be easily identified. I hope this makes sense — David Webb, CommentedMar 27, 2015 at 15:58

Misunderstood · Accepted Answer · 2015-03-27 21:32:33Z

At first glance I would first trim without a mask and then stringtolower inside before the preg_replace.

$modded = preg_replace(array( '/[^a-z0-9 ]/','<\W+>'),array( ' ', '_' ), strtolower(trim( $myvar[$i])));

trim first, less characters to process when there is white space to trim.
Then stringtolower() which allows '/[^a-zA-Z0-9 ]/i' to be simplified to '/[^a-z0-9 ]/'

Case insensitive tasks require extra CPU cycles.
RegEx is the usually least efficient string function especially compared to stringtolower() and trim()
Using these two function before the RegEx gives the RegEx less work to do.
Consider giving the trim additional mask characters: e.g.
space, new line, carriage return, asterisk, ampersand, at mark, parenthesis
Plus any other characters that could need to be trimmed (quote marks?).

This worked well for you test $myvar:

trim($str," \t\n\r\0\x0B\x28\x29\x26\x2a\x40\x5f")

RegEx is not my strong suit. While it works well this does not make sense to me. '/[^a-z0-9 ]/','<\W+>'
I'm not saying it is wrong, it's not broke, I just do not fully understand.

some minor stuff that bugs me:
You do not need to concatenate simple variables in double quotes

echo "modded2 = " . $modded." [ ". $i ." ] = " .$myvar[$i]. "\n";

This is the same but cleaner:

echo "modded2 = $modded [$i] = $myvar[$i]\n";

This may be more efficient, (or may not, did not benchmark):

for( $i = 0; $i < sizeof( $myvar ); ++$i )

But this is cleaner, easier to see what is going on.

foreach($myvar as $key => $value){

My preference is count over sizeof, although they are the same.

In a test environment like this, instead of <br> I use "\n" with a plain text header:

header('Content-Type: text/plain; charset=utf-8');

It works will with print_r() and var_export(), var_dump() etc.

My Tested Code:

header('Content-Type: text/plain; charset=utf-8'); $myvar = array("Paxus-Demo","Paxus Demo","paxus_demo","Paxus Demo", "paxus-Demo","paxus_ Demo","Paxus _Demo","*(&*& (*Paxus-Demo ", "@@Valid _voice","Valid-Voice","PortaL-Demo","gui_demo", "Gui-Demo ", "VoiceInstance","vAlid Voice"," vaLid_ _voiCe "); foreach($myvar as $k => $before){ $after = preg_replace( array( '/[^a-z0-9 ]/','<\W+>'), array( ' ', '_' ), strtolower(trim($before," \t\n\r\0\x0B\x28\x29\x26\x2a\x40\x5f" ))); // echo "$k. $before => $after\n"; echo "$k. $after <= \"$before\"\n"; }

The Result:

0. paxus_demo <= "Paxus-Demo" 1. paxus_demo <= "Paxus Demo" 2. paxus_demo <= "paxus_demo" 3. paxus_demo <= "Paxus Demo" 4. paxus_demo <= "paxus-Demo" 5. paxus_demo <= "paxus_ Demo" 6. paxus_demo <= "Paxus _Demo" 7. paxus_demo <= "*(&*& (*Paxus-Demo " 8. valid_voice <= "@@Valid _voice" 9. valid_voice <= "Valid-Voice" 10. portal_demo <= "PortaL-Demo" 11. gui_demo <= "gui_demo" 12. gui_demo <= "Gui-Demo " 13. voiceinstance <= "VoiceInstance" 14. valid_voice <= "vAlid Voice" 15. valid_voice <= " vaLid_ _voiCe "

Thanks for the comment. First off as for the comment "You do not need to concatenate simple variables in double quotes" I am aware of this, just a habit I have, plus this is a test script I wrote to test the validity of the normalization. The echo line is only there to show the result obviously. That said, I appreciate the insight given. Also FYI the for loop is only there to test...echo the multiple scenarios. The string manipulation is being added to a function that will be called whenever any type of normalization needs to be done for any field input. — David Webb, CommentedMar 27, 2015 at 20:11
Actually what i am looking for is whether its possible to move all of this array( '/[^a-zA-Z0-9 ]/i','<\W+>'), array( ' ', '_' ) into one regex expression that is without using arrays — David Webb, CommentedMar 27, 2015 at 20:17
I do not know RegEx well enough. But I believe it is necessary for this expression to be executed [^a-zA-Z0-9 ]/i','<\W+>') before ' ', '_' The question is: why? If it is to make it look cleaner, which is what I prefer, like: $patterns = array('[^a-zA-Z0-9 ]/i',' ' ) and $replacements = array('<\W+>', '_' ) using the classic: preg_replace($pattern, $replacement, $string); — Misunderstood, CommentedMar 27, 2015 at 20:40
I would keep the trim() and especially the strtolower() as they are probably much more efficient than RegEx. And use them Before the RegEx. stringtolower will be better than a-z|A-Z or /i the same as strpos is much more efficient than preg_match(). — Misunderstood, CommentedMar 27, 2015 at 20:50
The RegEx in this instance when used with preg_replace first replaces all characters outside of [a-zA-z0-9] range with a blank space, and then replaces each blank space with an underscore. — David Webb, CommentedMar 27, 2015 at 21:11

Casimir et Hippolyte · Accepted Answer · 2015-03-29 17:22:27Z

The main problem of your code is that you performs two replacements for each strings. Indeed when you pass arrays as pattern and replacement parameters to preg_replace, the whole string is parsed once per item. These two replacements are not needed.

You can avoid them if you replace all groups of characters that are not letters or digits with an underscore:

foreach($myvar as $item) { $res[] = strtolower( trim( preg_replace('/[\W_]+/', '_', $item ), '_' ) ); }

The result is about 2X faster.

@Casimir....this is what I was looking for...thanks a bunch...just goes to show I still have much to learn about using regular expressions — David Webb, CommentedMar 30, 2015 at 18:28

Stack Exchange Network

Normalizing strings using PHP preg_replace and regex expressions

2 Answers 2

My Tested Code:

The Result:

Hot Network Questions

Normalizing strings using PHP preg_replace and regex expressions

2 Answers 2

My Tested Code:

The Result:

Related

Hot Network Questions