4
\$\begingroup\$

I have a slice of strings, and within each string contains multiple key=value formatted messages. I want to pull all the keys out of the strings so I can collect them to use as the header for a CSV file. I do not know all potential key fields, so I have to use regular expression matching to find them.

Here is my code.

package main import ( "fmt" "regexp" ) func GetKeys(logs []string) []string { // topMatches is the final array to be returned. // midMatches contains no duplicates, but the data is `key=`. // subMatches contains all initial matches. // initialRegex matches for anthing that matches `key=`. this is because the matching patterns. // cleanRegex massages `key=` to `key` topMatches := []string{} midMatches := []string{} subMatches := []string{} initialRegex := regexp.MustCompile(`([a-zA-Z]{1,}\=)`) cleanRegex := regexp.MustCompile(`([a-zA-Z]{1,})`) // the nested loop for matches is because FindAllString // returns []string for _, i := range logs { matches := initialRegex.FindAllString(i, -1) for _, m := range matches { subMatches = append(subMatches, m) } } // remove duplicates. seen := map[string]string{} for _, x := range subMatches { if _, ok := seen[x]; !ok { midMatches = append(midMatches, x) seen[x] = x } } // this is where I remove the `=` character. for _, y := range midMatches { clean := cleanRegex.FindAllString(y, 1) topMatches = append(topMatches, clean[0]) } return topMatches } func main() { y := []string{"key=value", "msg=payload", "test=yay", "msg=payload"} y = GetKeys(y) fmt.Println(y) } 

I think my code is inefficient because I cannot determine how to properly optimise the initialRegex regular expression to match just the key in the key=value format without matching the value as well.

Can my first regular expression, initialRegex, be optimised so I do not have to do a second matching loop to remove the = character?

Playground: http://play.golang.org/p/ONMf_cympM

\$\endgroup\$
1
  • \$\begingroup\${1,} is equivalent to +, and if the Go regexp package supports it, you can use positive look-ahead to detect but not capture the =: [a-zA-Z]+(?=\=). IIRC, the second = doesn't need to be escaped since it has no special meaning outside of this context. Finally, I doubt you need the capturing group around the whole expression.\$\endgroup\$CommentedMar 6, 2016 at 20:51

2 Answers 2

6
\$\begingroup\$

You're not making good use of regular expressions. A single regex can do the job:

pattern := regexp.MustCompile(`([a-zA-Z]+)=`) 

The parentheses (...) are the capture the interesting part for you.

You can use result = pattern.FindAllStringSubmatch(s) to match a string against the regex pattern. The return value is a [][]string, where in each []string slice, the 1st element is the entire matched string, and the 2nd, 3rd, ... elements have the content of the capture groups. In this example we have one capture group (...), so the value of the key will be in item[1] of each []string slice.

Instead of a map[string]string map for seen, a map[string]boolean would be more efficient.

Putting it together:

func GetKeys(logs []string) []string { var keys []string pattern := regexp.MustCompile(`([a-zA-Z]+)=`) seen := make(map[string]bool) for _, log := range(logs) { result := pattern.FindAllStringSubmatch(log, -1) for _, item := range result { key := item[1] if _, ok := seen[key]; !ok { keys = append(keys, key) seen[key] = true } } } return keys } 

If the input strings are not guaranteed to be in the right format matching the pattern, then you might want to add a guard statement inside the main for loop, for example:

 if len(result) != 2 { continue } 
\$\endgroup\$
8
  • \$\begingroup\$I was reading about submatches, but I wasn't entirely sure if that's what I wanted or not because I didn't understand that was for capture groups. That makes a lot of sense. What is more efficient about a boolean over the string?\$\endgroup\$
    – Sienna
    CommentedMar 4, 2016 at 20:32
  • \$\begingroup\$@mynameismevin the storage of a bool is smaller than a string\$\endgroup\$
    – janos
    CommentedMar 6, 2016 at 20:53
  • \$\begingroup\$The original code used FindAllString to find multiple matches per message, but yours uses FindStringSubmatch which looks to return a single match.\$\endgroup\$CommentedMar 7, 2016 at 21:36
  • \$\begingroup\$A map[string]struct{} is even more efficient and uses less memory than map[string]bool.\$\endgroup\$
    – OneOfOne
    CommentedMar 28, 2016 at 8:38
  • \$\begingroup\$@OneOfOne really? how? and how would I change the line seen[key] = true to make that work?\$\endgroup\$
    – janos
    CommentedMar 28, 2016 at 10:13
3
\$\begingroup\$

I know this is an old question but it popped up in my feed so I figured I'd contribute.

Out of curiosity, why use a regular expression at all? You could achieve the same thing use standard strings package and keep things simple. Here's a Playground that outputs the same result as your Playground.

package main import ( "fmt" "strings" ) func GetKeys(logs []string) []string { exists := make(map[string]bool) keys := make([]string, 0) for _, log := range logs { parts := strings.Split(log, "=") if len(parts) >= 1 { k := parts[0] if !exists[k] { keys = append(keys, k) exists[k] = true } } } return keys } func main() { y := []string{"key=value", "msg=payload", "test=yay", "msg=payload"} fmt.Println(GetKeys(y)) } 
\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.