Working from a CSV of about 14.000 lines, I need to match dates in the CSV against some API. I'm reading the CSV using Pandas (for various other reasons). Most often, the date is just a year, i.e. an integer, which Pandas converts to a float. Other date formats also occur (see main
in the code below). 12-10-1887 is the 12th of October, not the 10th of December.
The API also returns date strings, often also incomplete, for example when only year but not month and/or day of birth are known. Single-digit days and months do get a leading zero, and format is as close to ISO-date as possible, e.g. March 1675 will be "1675-03".
I think my best bet is not to convert to type datetime
, but to try to match strings.
EDIT: because this way I can let the API do more work.
http://baseURL/name:Gupta,Anil&date:1956
Without the date, I get all Anil Gupta's. With it, I get all Anil Gupta's with either born year (most likely), death year, of year of activity 1956. All in all, this is MUCH less work than writing dateconverters for fields that are empty in 90% of the cases. In general, name + exact date of birth is a pretty unique identifier.
So I wrote this dateclean.py (Python 3.7):
import re from datetime import datetime as dt def cleanup(date): patterns = [# 0) 1-12-1963 r'(\d{1,2})-(\d{1,2})-(\d{4})$', # 1) 1789-7-14 r'(\d{4})-(\d{1,2})-(\d{1,2})$', # 2) '1945-2' r'(\d{4})-(\d{1,2})$', # 3) 2-1883 r'(\d{1,2})-(\d{4})$' ] try: return str(int(date)) except ValueError: pass for pat in patterns: q = re.match(pat, date) if q: if pat == patterns[0]: year = re.sub(patterns[0], r'\3', date) month = re.sub(patterns[0], r'\2', date) day = re.sub(patterns[0], r'\1', date) return '{0}-{1:0>2}-{2:0>2}'.format(year, month, day) if pat == patterns[1]: year = re.sub(patterns[1], r'\1', date) month = re.sub(patterns[1], r'\2', date) day = re.sub(patterns[1], r'\3', date) return '{0}-{1:0>2}-{2:0>2}'.format(year, month, day) if pat == patterns[2]: year = re.sub(patterns[2], r'\1', date) month = re.sub(patterns[2], r'\2', date) return '{0}-{1:0>2}'.format(year, month) if pat == patterns[3]: year = re.sub(patterns[3], r'\2', date) month = re.sub(patterns[3], r'\1', date) return '{0}-{1:0>2}'.format(year, month) else: return date def main(): dates = 1858.0, '1-12-1963', '1945-2', '7-2018', '1789-7-14', for date in dates: print('in: {} out: {}'.format(date, cleanup(date))) if __name__ == "__main__": main()
The list dates
in main()
contains all formats known until now. On the plus side, by using this method, adding new formats is easy.
There are a lot of patterns[x]
, which is not at all DRY, yet I don't see how I can avoid that: I have to split each input string into year, month and day, if any (using r'\1'
etc.)
Also, I know flat is better than nested, but I don't see how I can improve on that. I do try to return as early as possible.
Readability is more important to me than performance, because the API is a bottleneck anyway. Then again, serious slowing-downers should be avoided.