Python script for parsing and using CSV data

Question

I am given a CSV file of stations, with data like this:

station_id,date,temperature_c 68,2000.375,10.500 68,2000.542,5.400 68,2000.958,23.000 68,2001.125,20.400 68,2001.292,13.300 68,2001.375,10.400 68,2001.958,21.800 68,2002.208,15.500

and so on for many different station_ids.

Then I want to create a Python program that (1) gives the minimum reading (the third column) and (2) the station with the maximum "travel distance" with its readings. Thus if a station has 3 readings of -5,0,8 then that would mean a travel distance of 13. This can take an optional date range. Here is what I did.

#!/usr/bin/python from collections import defaultdict import csv import random import sys # In order to track each station's statistics, we'll create a Station class # to hold the data on a per-station basis. class Station: def __init__(self): self.readings = [] self.minimum = 99999999.0 self.travel = 0 # travel holds the change in temperature reading-by-reading def get_travel(self): return self.travel def set_travel(self, n): self.travel += abs(n) # getter & setter for station minimums def get_minimum(self): return self.minimum def set_minimum(self, n): self.minimum = n # infrastructure for future code expansion def get_readings(self): return self.readings def set_readings(self, date, temp): self.readings.append({ "date" : date, "temp" : temp}) """ Reporter class handles a list of Stations """ class Reporter: def __init__(self): # stations dict with entries for holding the specified stats. self.stations = defaultdict(Station) self.global_minimum = { "station" : "default", "date" : 1, "temp" : 9999999 } self.longest_travel = { "station" : "default", "range" : 0 } """ Determines which station recorded the coldest temperature args: CSV file returns: dict with data """ def minimum_temperature(self, filename): with open(filename, 'r') as datafile: try: csv_reader = csv.reader(datafile) next(csv_reader) # reading line-by-line since CSV could be a big file for row in csv_reader: station, date, temp = row # save the station's readings self.stations[station].set_readings(date, temp) temp = float(temp) if (temp < self.stations[station].get_minimum()): self.stations[station].set_minimum(temp) if(temp < self.global_minimum["temp"]): self.global_minimum = { "station" : station, "temp" : temp, "date" : date } # The specs state that in the event that a tie occurs simply return # one pair at random. if (temp == self.global_minimum["temp"]): if (random.randint(1,100) % 2 == 0): self.global_minimum = { "station" : station, "date" : date, "temp" : temp } except csv.Error as e: sys.exit('file {}, line {}: {}'.format(filename, reader.line_num, e)) return self.global_minimum """ Determines which station "traveled" the most args: CSV file, begin date (optional), end date (optional) returns: dict with data """ def max_travel(self,filename,begin=1.0,end=9999.9): with open(filename, 'r') as datafile: try: csv_reader = csv.reader(datafile) next(csv_reader) # reading line-by-line since CSV could be a big file for row in csv_reader: station, date, temp = row # save for future expansion self.stations[station].set_readings(date, temp) date = float(date) if date > begin and date < end: temp = float(temp) self.stations[station].set_travel(temp) travel = self.stations[station].get_travel() if ( travel > self.longest_travel["range"]): self.longest_travel = { "station" : station, "range" : travel } except csv.Error as e: sys.exit('file {}, line {}: {}'.format(filename, reader.line_num, e)) return self.longest_travel if __name__ == "__main__": csv_file = sys.argv[1] # fetch lowest temperature reporter = Reporter() global_minimum = reporter.minimum_temperature(csv_file) print("station {} had the global minimum on {}".format(global_minimum["station"], global_minimum["date"])) # fetch maximum travel overall longest_travel = reporter.max_travel(csv_file) print("station {} had the greatest travel at {}".format(longest_travel["station"], longest_travel["range"])) # now try a date range reporter2 = Reporter() begin = 2001.0 end = 2006.0 longest_travel = reporter2.max_travel(csv_file,begin,end) print("for {} to {}, station {} had the greatest travel at {}".format(begin, end, longest_travel["station"], longest_travel["range"]))

I'm particularly interested in speeding it up and memory usage but also how to Pythonically deal with with getters/setters.

C.Nivs · Accepted Answer · 2021-06-30 17:57:19Z

Getters and Setters

If you are to use this pattern, the @property decorator will accomplish this in a more pythonic way:

class Station: def __init__(self): self.readings = [] self._minimum = 99999999.0 self._travel = 0 @property def minimum(self, value): return self._minimum @minimum.setter def minimum(self, value): self._minimum = value @property def travel(self, value): return self._travel @travel.setter def travel(self, value): self._travel = abs(value) def add_reading(self, date, temp): self.readings.append({ "date" : date, "temp" : temp}) # access the elements this way station = Station() print(station.travel) 0 station.travel += 1 print(station.travel) 1

However, the setter pattern doesn't really make sense for your readings, because you aren't really setting, you're appending. So keep a function that appends a reading to the readings list, it's more explicit this way. I wouldn't expect:

station.readings = ('date', 'reading')

to do anything other than set the attribute, but it actually has a surprising side-effect.

Spacing

Your indentation should be four spaces, not two.

ShapeOfMatter · Accepted Answer · 2021-07-01 11:38:21Z

Probably a typo in your except clauses; I suspect it's supposed to say csv_reader.line_num.
What is going on with the "travel"? What you've actually written is "sum of absolute values", but I assume what you actually want is either "max minus min" or "sum of day-to-day deltas".
My opinion is that getters/setters/properties are inherently un-pythonic. More importantly, they're opaque. They look like attributes but they're functions that can have side-effects. Don't hide the possibility of side-effects from the audience; do just use an actual attribute wwh
I like the DictReader provided by the csv module. It's not everything we might ask for, but it'll save us accidentally handling malformed data.
You're doing a lot of work to handle the file in a stream, but you're also reading the file multiple times and you're keeping all the data in memory once it's read.
- This could get buggy. For example, in your main, you re-use reporter for the second pass, which means it will contain every datapoint in duplicate.
- You should pick one: Either get everything you want in a single pass, or read all the data into an appropriate structure and run queries on that. Since you specify speed and memory usage as concerns, I'll assume you want the first choice (but I suspect that's the wrong decision).
You probably don't need a Reporter class at all. It's not helping you manage any state (that you want to keep) across functions, so just declare your functions in the top namespace.

Stack Exchange Network

Python script for parsing and using CSV data

2 Answers 2

Getters and Setters

Spacing

Hot Network Questions

Python script for parsing and using CSV data

2 Answers 2

Getters and Setters

Spacing

Related

Hot Network Questions