Processing large file in Python

Question

I have some code that calculates the "sentiment" of a Tweet.

The task starts with an AFINN file, that is a tab-separated list of around 2500 key-value pairs. I read this into a dict using the following function:

csv.register_dialect('tsv', delimiter='\t') def processSentiments(file): with open(file) as sentiments: reader = csv.reader(sentiments,'tsv') return dict((word, int(sentiment)) for word, sentiment in reader)

I'm happy with this code - it's pretty clear and I'm not worried about its speed and memory usage as the file is small.

Using the created dict I now need to process many tweets. Lets say I have 1GB of then in JSON format - from the live stream.

What I need to do is:

read a tweet file, with a JSON tweet on each line
parse each tweet to a dict using json.loads
extract the text field from the tweet - giving the content of the tweet
for each word in the content, check it if has a sentiment
for each sentiment word in the tweet, calculate it's value (from the AFINN dict) and sum across the tweet
store that number

So whilst I have a lot of tweet data, I only need a list of integers. This is obviously much smaller.

I want to be able to stream the data from the file and convert each line to its sentiment value in the most efficient way possible. Here's what I came up with:

def extractSentiment(tweetText, sentimentsDict): return sum([sentimentsDict.get(word, 0) for word in tweetText.split()]) def processTweet(tweet, sentimentsDict): try: tweetJson = json.loads(tweet) except ValueError, e: raise Exception('Invalid tweet: ', tweet, e) tweetText = tweetJson.get('text', False) if tweetText: return extractSentiment(tweetText, sentimentsDict) else: return 0 def processTweets(file, sentimentsDict): with open(file) as tweetFile: return [processTweet(tweet, sentimentsDict) for tweet in tweetFile]

So I am using a for-comprehension over the tweetsfile to map each line in the file to its corresponding number.

So my question is, how would you rewrite the above the maximise speed and minimise memory usage.

Any other comments on the Python would be welcome too.

I realise that I can use a single-producer -> multiple-consumer pattern to speed up the processing of the file, but I would for now like to stick with a single thread.

@rolfl Software to "detect false positives". Interesting... Although, what about false false positives... — Boris the Spider, CommentedJul 6, 2014 at 18:27

janos · Accepted Answer · 2014-07-06 09:33:46Z

As you may have probably suspected, the list comprehension in processTweets makes this non-streaming and eat a lot of memory, as it has to contain the entire result dictionary in memory before returning to the caller. Which might be fine, as it's just a list of integers.

You can make this streaming by turning this method into a generator:

def processTweets(path, sentimentsDict): with open(path) as fh: for tweet in fh: yield processTweet(tweet, sentimentsDict)

Note that I renamed your variables (file -> path, tweetFile -> fh, because file shadows an existing Python class name (and usually syntax highlighted), and because I like to use fh for throwaway filehandles ;-)

You don't need the default False value in tweetJson.get('text', False), this will work just fine, because the .get will return None, which is falsy:

tweetText = tweetJson.get('text') if tweetText: return extractSentiment(tweetText, sentimentsDict) else: return 0

In processTweet, you catch ValueError and raise an Exception. Since you are not doing anything special to handle the ValueError, you could just let the error bubble up, and it might speed up the process a bit if Python doesn't need to wrap the code within a try-catch for every single tweet.

On your first point - I think that's fine. I need to process the integers multiple times so I suppose I need to put them into a list anyway. My worry was that the for-comprehension would load the entire file into memory, or does it stream it? Thanks for the other tips! I keep forgetting about the falsyness of all things Python... — Boris the Spider, CommentedJul 6, 2014 at 9:45

Jamal · Accepted Answer · 2014-07-11 14:38:13Z

Here are two notes beyond the 'maximise speed and minimise memory usage' question:

Note the problem with UTF-8 encoding for the word 'naïve' with the present implementation:

reader = csv.reader(sentiments,'tsv')

Also note the problem of tokenization of words used here:

return sum([sentimentsDict.get(word, 0) for word in tweetText.split()])

Ordinary string split method is usually not good for natural language, e.g., comma after a word is a problem. You'll probably want to look into re.split or, e.g., NLTK.

Stack Exchange Network

Processing large file in Python

2 Answers 2

Hot Network Questions

Processing large file in Python

2 Answers 2

Related

Hot Network Questions