3
\$\begingroup\$

I parse a big source code directory (100k files). I traverse every line in every file and look for function calls via regex matching.

I know that using regex to parse languages is a terrible idea. However, I'm looking for simplicity and not interested in the whole parse tree. Moreover, the regex pattern is generic and applies to all languages with the func(arg) syntax.

Following some profiling that I did, my hotspot is of course the function that responsible of processing a single line and search matches in that line. In big code projects, this function is called about 25 million times. I was wondering weather there is room for optimization in the following 9 lines, or is this as good as it gets:

def get_line_symbols(self, line): symbols = [] matches = self.index_pattern.finditer(line) for match in matches: while match: symbols.append(match.group(1)) symbol_args = match.group(2) match = self.index_pattern.search(symbol_args) return symbols 

self.index_pattern is re.compile('(\w+)(\([^\)]*\)?)')

\$\endgroup\$

    1 Answer 1

    3
    \$\begingroup\$

    I can improve the performance by 25% or so, by stopping the regular expression after the opening parenthesis, and so avoiding the need for a while loop or indeed a function at all:

    get_line_symbols = re.compile(r'(\w+)\(').findall 

    (Note that it's a good idea to use the r modifier on regular expression strings, to avoid the risk of accidentally writing something like '\b', which needs to be written as r'\b' or '\\b' if it's to mean a word boundary rather than the backspace character.)

    I get a further speedup by switching from the built-in re library to the third-party regex library. Using both of these improvements is about three times as fast as the code in the post.

    \$\endgroup\$
    4
    • \$\begingroup\$the while loop is used to handle a line with nested function calls, f(g(h)). In this case, you need to regex the arguments until there's no match.\$\endgroup\$
      – susdu
      CommentedJul 6, 2017 at 18:00
    • 2
      \$\begingroup\$@susdu Your original regex attempts to look for matching pairs of parentheses, but it actually doesn't. For f(g(h)), the first match would be f(g(h) — note that the closing parenthesis is dropped. Since regular expressions are no good at identifying nested matching pairs, you would be better off just looking for the opening parenthesis.\$\endgroup\$CommentedJul 6, 2017 at 18:07
    • 1
      \$\begingroup\$@200_success is quite right — the complicated regex in the question isn't finding any more matches than the simple regex in the answer. Try it out and see for yourself!\$\endgroup\$CommentedJul 6, 2017 at 20:22
    • \$\begingroup\$I guess I really over-complicated this, thanks guys\$\endgroup\$
      – susdu
      CommentedJul 6, 2017 at 21:41

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.