Optimise python functions for speed

Question

Any one can help me optimise those three functions? I did profiling and timing of my original python file and found that most calls and time duration was because of those three functions.

The three functions are from a text normaliser for text processing. The full python file is available if anyone wants to have a look at the whole script. Thank you

 def __rstrip(self, token): for i in range(5): if len(token): if token[-1] in [',', '.', ';', '!', '?', ':', '"']: token = token[:-1] else: break return token def __lstrip(self, token): for i in range(5): if len(token): if token[0] in [',', '.', ';', '!', '?', ':', '"', '\'']: token = token[1:] else: break return token def __generate_results(self, original, normalised): words = [] for t in normalised: if len(t[0]): words.append(t[0]) text = ' '.join(words) tokens = [] if len(original): for t in original: idx = t[1] words = [] for t2 in normalised: if idx == t2[1]: words.append(t2[0]) display_text = self.__rstrip(t[0]) display_text = self.__lstrip(display_text) tokens.append((t[0], words, display_text)) else: tokens.append(('', '', '')) return text, tokens

Would you mind giving a little more detail concerning the __lstrip and __rstrip functions? Is the goal of these function to remove the first/last 5 symbols of a token? Because it seems like a weird use case. — IEatBagels, CommentedSep 11, 2020 at 18:53
You need to provide more information about the data structures. For example, how many elements are typically in a token? If these sequences are large, your current stripping methods could be quite inefficient (you create a new list with each iteration); but if they are tiny, that might not be a major issue. Or, how large is normalised and what type of collection is it? If it's a large list/tuple, repeated in checks might be costly, but if it's small or if it's a dict/set, then perhaps that's not the source of trouble. Currently, your question is too abstract. — FMc, CommentedSep 11, 2020 at 19:55

benrg · Accepted Answer · 2020-09-11 23:39:08Z

Here are some suggestions for general code cleanup. I can't guarantee that any of these changes will improve performance.

if len(xxx):

Almost certainly you can replace these with if xxx:. It's not guaranteed to be the same, but almost all types that support len will test false if their length is 0. This includes strings, lists, dicts, and other standard Python types.

I would turn __rstrip and __lstrip into top-level functions, or at least make them @staticmethods since they don't use self. __generate_results could be a static method as well.

__rstrip and __lstrip seem to reimplement the functionality of the built-in str.rstrip and str.lstrip. Possibly, you could replace them with

display_text = t[0].rstrip(',.;!?:"') display_text = display_text.lstrip(',.;!?:"\'')

If it's really important that at most 5 characters be stripped, you can do that like this:

def _rstrip(token): return token[:-5] + token[-5:].rstrip(',.;!?:"') def _lstrip(token): return token[:5].lstrip(',.;!?:"\'') + token[5:]

token[-1] in [',', '.', ';', '!', '?', ':', '"']

Either token[-1] in {',', '.', ';', '!', '?', ':', '"'} or token[-1] in ',.;!?:"' would be faster. The latter is riskier since it's not exactly the same: it will do a substring test if token[-1] isn't a single character. But if token is a string then token[-1] is guaranteed to be a single character.

words = [] for t in normalised: if len(t[0]): words.append(t[0])

You can replace this with words = [t[0] for t in normalised if t[0]].

 words = [] for t2 in normalised: if idx == t2[1]: words.append(t2[0])

This could be a significant source of inefficiency. You could try making a lookup table outside the loop:

normalized_lookup = collections.defaultdict(list) for t in normalized: normalized_lookup[t[1]].append(t[0])

and then replace the quoted code with words = normalized_lookup[idx]. This could end up being slower, though. Also, it has the side effect that lists for the same value of idx will be shared, which could cause subtle, hard-to-catch aliasing bugs down the line. If that's an issue, write words = list(normalized_lookup[idx]) (or tuple).

Stack Exchange Network

Optimise python functions for speed

1 Answer 1

Hot Network Questions

Optimise python functions for speed

1 Answer 1

Related

Hot Network Questions