6
\$\begingroup\$

I have some code that calculates the "sentiment" of a Tweet.

The task starts with an AFINN file, that is a tab-separated list of around 2500 key-value pairs. I read this into a dict using the following function:

csv.register_dialect('tsv', delimiter='\t') def processSentiments(file): with open(file) as sentiments: reader = csv.reader(sentiments,'tsv') return dict((word, int(sentiment)) for word, sentiment in reader) 

I'm happy with this code - it's pretty clear and I'm not worried about its speed and memory usage as the file is small.

Using the created dict I now need to process many tweets. Lets say I have 1GB of then in JSON format - from the live stream.

What I need to do is:

  1. read a tweet file, with a JSON tweet on each line
  2. parse each tweet to a dict using json.loads
  3. extract the text field from the tweet - giving the content of the tweet
  4. for each word in the content, check it if has a sentiment
  5. for each sentiment word in the tweet, calculate it's value (from the AFINN dict) and sum across the tweet
  6. store that number

So whilst I have a lot of tweet data, I only need a list of integers. This is obviously much smaller.

I want to be able to stream the data from the file and convert each line to its sentiment value in the most efficient way possible. Here's what I came up with:

def extractSentiment(tweetText, sentimentsDict): return sum([sentimentsDict.get(word, 0) for word in tweetText.split()]) def processTweet(tweet, sentimentsDict): try: tweetJson = json.loads(tweet) except ValueError, e: raise Exception('Invalid tweet: ', tweet, e) tweetText = tweetJson.get('text', False) if tweetText: return extractSentiment(tweetText, sentimentsDict) else: return 0 def processTweets(file, sentimentsDict): with open(file) as tweetFile: return [processTweet(tweet, sentimentsDict) for tweet in tweetFile] 

So I am using a for-comprehension over the tweetsfile to map each line in the file to its corresponding number.

So my question is, how would you rewrite the above the maximise speed and minimise memory usage.

Any other comments on the Python would be welcome too.

I realise that I can use a single-producer -> multiple-consumer pattern to speed up the processing of the file, but I would for now like to stick with a single thread.

\$\endgroup\$
2
  • 2
    \$\begingroup\$Planning a tender to the Secret Service? ;-)\$\endgroup\$
    – rolfl
    CommentedJul 6, 2014 at 13:48
  • \$\begingroup\$@rolfl Software to "detect false positives". Interesting... Although, what about false false positives...\$\endgroup\$CommentedJul 6, 2014 at 18:27

2 Answers 2

6
\$\begingroup\$

As you may have probably suspected, the list comprehension in processTweets makes this non-streaming and eat a lot of memory, as it has to contain the entire result dictionary in memory before returning to the caller. Which might be fine, as it's just a list of integers.

You can make this streaming by turning this method into a generator:

def processTweets(path, sentimentsDict): with open(path) as fh: for tweet in fh: yield processTweet(tweet, sentimentsDict) 

Note that I renamed your variables (file -> path, tweetFile -> fh, because file shadows an existing Python class name (and usually syntax highlighted), and because I like to use fh for throwaway filehandles ;-)


You don't need the default False value in tweetJson.get('text', False), this will work just fine, because the .get will return None, which is falsy:

tweetText = tweetJson.get('text') if tweetText: return extractSentiment(tweetText, sentimentsDict) else: return 0 

In processTweet, you catch ValueError and raise an Exception. Since you are not doing anything special to handle the ValueError, you could just let the error bubble up, and it might speed up the process a bit if Python doesn't need to wrap the code within a try-catch for every single tweet.

\$\endgroup\$
1
  • \$\begingroup\$On your first point - I think that's fine. I need to process the integers multiple times so I suppose I need to put them into a list anyway. My worry was that the for-comprehension would load the entire file into memory, or does it stream it? Thanks for the other tips! I keep forgetting about the falsyness of all things Python...\$\endgroup\$CommentedJul 6, 2014 at 9:45
3
\$\begingroup\$

Here are two notes beyond the 'maximise speed and minimise memory usage' question:

Note the problem with UTF-8 encoding for the word 'naïve' with the present implementation:

reader = csv.reader(sentiments,'tsv') 

Also note the problem of tokenization of words used here:

return sum([sentimentsDict.get(word, 0) for word in tweetText.split()]) 

Ordinary string split method is usually not good for natural language, e.g., comma after a word is a problem. You'll probably want to look into re.split or, e.g., NLTK.

\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.