I parse a big source code directory (100k files). I traverse every line in every file and look for function calls via regex matching.
I know that using regex to parse languages is a terrible idea. However, I'm looking for simplicity and not interested in the whole parse tree. Moreover, the regex pattern is generic and applies to all languages with the func(arg)
syntax.
Following some profiling that I did, my hotspot is of course the function that responsible of processing a single line and search matches in that line. In big code projects, this function is called about 25 million times. I was wondering weather there is room for optimization in the following 9 lines, or is this as good as it gets:
def get_line_symbols(self, line): symbols = [] matches = self.index_pattern.finditer(line) for match in matches: while match: symbols.append(match.group(1)) symbol_args = match.group(2) match = self.index_pattern.search(symbol_args) return symbols
self.index_pattern
is re.compile('(\w+)(\([^\)]*\)?)')