I have some code that calculates the "sentiment" of a Tweet.
The task starts with an AFINN file, that is a tab-separated list of around 2500 key-value pairs. I read this into a dict
using the following function:
csv.register_dialect('tsv', delimiter='\t') def processSentiments(file): with open(file) as sentiments: reader = csv.reader(sentiments,'tsv') return dict((word, int(sentiment)) for word, sentiment in reader)
I'm happy with this code - it's pretty clear and I'm not worried about its speed and memory usage as the file is small.
Using the created dict
I now need to process many tweets. Lets say I have 1GB of then in JSON format - from the live stream.
What I need to do is:
- read a tweet file, with a JSON tweet on each line
- parse each tweet to a
dict
usingjson.loads
- extract the
text
field from the tweet - giving the content of the tweet - for each word in the content, check it if has a sentiment
- for each sentiment word in the tweet, calculate it's value (from the AFINN
dict
) and sum across the tweet - store that number
So whilst I have a lot of tweet data, I only need a list of integers. This is obviously much smaller.
I want to be able to stream the data from the file and convert each line to its sentiment value in the most efficient way possible. Here's what I came up with:
def extractSentiment(tweetText, sentimentsDict): return sum([sentimentsDict.get(word, 0) for word in tweetText.split()]) def processTweet(tweet, sentimentsDict): try: tweetJson = json.loads(tweet) except ValueError, e: raise Exception('Invalid tweet: ', tweet, e) tweetText = tweetJson.get('text', False) if tweetText: return extractSentiment(tweetText, sentimentsDict) else: return 0 def processTweets(file, sentimentsDict): with open(file) as tweetFile: return [processTweet(tweet, sentimentsDict) for tweet in tweetFile]
So I am using a for-comprehension over the tweetsfile to map
each line in the file to its corresponding number.
So my question is, how would you rewrite the above the maximise speed and minimise memory usage.
Any other comments on the Python would be welcome too.
I realise that I can use a single-producer -> multiple-consumer pattern to speed up the processing of the file, but I would for now like to stick with a single thread.