Cleaning up date strings in Python

Question

Working from a CSV of about 14.000 lines, I need to match dates in the CSV against some API. I'm reading the CSV using Pandas (for various other reasons). Most often, the date is just a year, i.e. an integer, which Pandas converts to a float. Other date formats also occur (see main in the code below). 12-10-1887 is the 12th of October, not the 10th of December.

The API also returns date strings, often also incomplete, for example when only year but not month and/or day of birth are known. Single-digit days and months do get a leading zero, and format is as close to ISO-date as possible, e.g. March 1675 will be "1675-03".

I think my best bet is not to convert to type datetime, but to try to match strings.

EDIT: because this way I can let the API do more work.

http://baseURL/name:Gupta,Anil&date:1956

Without the date, I get all Anil Gupta's. With it, I get all Anil Gupta's with either born year (most likely), death year, of year of activity 1956. All in all, this is MUCH less work than writing dateconverters for fields that are empty in 90% of the cases. In general, name + exact date of birth is a pretty unique identifier.

So I wrote this dateclean.py (Python 3.7):

import re from datetime import datetime as dt def cleanup(date): patterns = [# 0) 1-12-1963 r'(\d{1,2})-(\d{1,2})-(\d{4})$', # 1) 1789-7-14 r'(\d{4})-(\d{1,2})-(\d{1,2})$', # 2) '1945-2' r'(\d{4})-(\d{1,2})$', # 3) 2-1883 r'(\d{1,2})-(\d{4})$' ] try: return str(int(date)) except ValueError: pass for pat in patterns: q = re.match(pat, date) if q: if pat == patterns[0]: year = re.sub(patterns[0], r'\3', date) month = re.sub(patterns[0], r'\2', date) day = re.sub(patterns[0], r'\1', date) return '{0}-{1:0>2}-{2:0>2}'.format(year, month, day) if pat == patterns[1]: year = re.sub(patterns[1], r'\1', date) month = re.sub(patterns[1], r'\2', date) day = re.sub(patterns[1], r'\3', date) return '{0}-{1:0>2}-{2:0>2}'.format(year, month, day) if pat == patterns[2]: year = re.sub(patterns[2], r'\1', date) month = re.sub(patterns[2], r'\2', date) return '{0}-{1:0>2}'.format(year, month) if pat == patterns[3]: year = re.sub(patterns[3], r'\2', date) month = re.sub(patterns[3], r'\1', date) return '{0}-{1:0>2}'.format(year, month) else: return date def main(): dates = 1858.0, '1-12-1963', '1945-2', '7-2018', '1789-7-14', for date in dates: print('in: {} out: {}'.format(date, cleanup(date))) if __name__ == "__main__": main()

The list dates in main() contains all formats known until now. On the plus side, by using this method, adding new formats is easy.

There are a lot of patterns[x], which is not at all DRY, yet I don't see how I can avoid that: I have to split each input string into year, month and day, if any (using r'\1' etc.)

Also, I know flat is better than nested, but I don't see how I can improve on that. I do try to return as early as possible.

Readability is more important to me than performance, because the API is a bottleneck anyway. Then again, serious slowing-downers should be avoided.

301_Moved_Permanently · Accepted Answer · 2018-07-24 17:27:11Z

For starter, congratulations, the code is clean and uses rather good constructs. I just have two nitpicks about the layout:

I prefer to indent lists before the first row:

patterns = [ # 0) 1-12-1963 r'(\d{1,2})-(\d{1,2})-(\d{4})$', # 1) 1789-7-14 r'(\d{4})-(\d{1,2})-(\d{1,2})$', # 2) '1945-2' r'(\d{4})-(\d{1,2})$', # 3) 2-1883 r'(\d{1,2})-(\d{4})$', ]

PEP8 recommends two blank lines before top-level function definition.

You can also simplify the boilerplate of a try: … except Something: pass by using the contextlib.suppress context manager instead:

with suppress(ValueError): return str(int(date))

Lastly your pattern manipulation:

a match object can return all groups at once using its groups method, no need to special parse them;
you can unpack the groups tuple as arguments to format directly;
you can compile all your regexp beforehand instead of letting the regexp engine do it on-the-fly;
you should associate the output format string to each pattern to remove the need to search back which pattern matched and how it should be processed.

Proposed improvements:

import re from contextlib import suppress PATTERNS = [ # 0) 1-12-1963 => 1963-12-01 (re.compile(r'(\d{1,2})-(\d{1,2})-(\d{4})$'), '{2}-{1:0>2}-{0:0>2}'), # 1) 1789-7-14 => 1789-07-14 (re.compile(r'(\d{4})-(\d{1,2})-(\d{1,2})$'), '{0}-{1:0>2}-{2:0>2}'), # 2) 1945-2 => 1945-02 (re.compile(r'(\d{4})-(\d{1,2})$'), '{0}-{1:0>2}'), # 3) 2-1883 => 1883-02 (re.compile(r'(\d{1,2})-(\d{4})$'), '{1}-{0:0>2}'), ] def cleanup(date): with suppress(ValueError): return str(int(date)) for pattern, formater in PATTERNS: match = pattern.match(date) if match is not None: return formater.format(*match.groups()) return date def main(): dates = 1858.0, '1-12-1963', '1945-2', '7-2018', '1789-7-14', for date in dates: print(f'in: {date}, out: {cleanup(date)}') if __name__ == '__main__': main()

Un Grand MERCI! I would approve all answers if I could, but this one I think is bestestest. — RolfBly, CommentedJul 24, 2018 at 9:59
Just as with the other answer, compiling regexes is not guaranteed to be any faster than using plain regexes (discussion here). — Daniel, CommentedJul 24, 2018 at 10:25
@Daniel Changed the wording accordingly. But this kind of behaviour should be considered implementation details and is not necessarily present in anything other than cPython, so compiling regexes is still a good habit to have. — 301_Moved_Permanently, CommentedJul 24, 2018 at 10:31
This function will return as soon as there's a valid match, so it will return the first valid match it finds. Dictionaries don't have reliable ordering. So I would recommend using a list of lists or list of tuples if one wishes to have assurances as to which pattern is given priority (note that PATTERNS is effectively being used as a list anyway; the dictionary structure of it is never used). — Acccumulation, CommentedJul 24, 2018 at 17:21
@Acccumulation You're bloody right. I'm too used to Python 3.6 dictionaries being ordered by design. — 301_Moved_Permanently, CommentedJul 24, 2018 at 17:28

Daniel · Accepted Answer · 2018-07-24 10:03:32Z

I would follow @Mathias answer for the Pythonic comments, however...

I think my best bet is not to convert to type datetime, but to try to match strings.

I disagree, Don't Reinvent the wheel.

Instead of manually cleaning these dates, there are some really good libraries that can do this for you. Check out the python-dateutil library.

To avoid the special cases of 12-10-1887 being the 12th of October, not the 10th of December, you can use parser.parse(d, dayfirst=True) which will assume day-month format rather than month-day.

As @Mathias mentioned, dateutil is similar to pandas, so this would work with .to_datetime(dayfirst=True) too, see the docs.

from dateutil import parser def cleanup(dates): for date in dates: try: yield parser.parse(date, dayfirst=True) except (ValueError, TypeError) as e: print("Exception {} on unhandled date {}".format(e, date)) if __name__ == "__main__": dates = [1858.0, '1-12-1963', '1945-2', '7-2018', '1789-7-14'] for date in cleanup(dates): print(date)

Except the dateutil parser (at least using it's default configuration) suffer the same defect than pandas.to_datetime (which could have been used by the OP in the first place if it worked): it will consider '12-10-1887' as the 10th of December, 1887. — 301_Moved_Permanently, CommentedJul 24, 2018 at 8:49
@Ludisposed I edited the question to clarify why I chose for strings. I will certainly look at dateutil, tho. — RolfBly, CommentedJul 24, 2018 at 9:35

301_Moved_Permanently · Accepted Answer · 2018-07-24 14:56:29Z

The patterns list is not doing you any good here: you end up doing tests like if pat == patterns[0] to see which regex was tested, so you may as well have written re.match(…) four times instead of looping. (By the way, use re.fullmatch() to avoid writing $ in every regex.)

Another misstep that makes your regexes ineffective is that they are not self-documenting. You end up writing code elsewhere that decides which group is the year, month, and date. To remedy that, use named capture groups. Also, you are discarding information captured in q. To extract information from the match, use q.group(…); don't perform substitutions with backreferences.

I'm not fond of the str(int(date)) special case. Ideally, you should avoid calling cleanup() with a parameter that could be either a string or number. If you're importing your CSV using PANDAS, then specify that you always want to treat that column as a string. (If you use cleanup in the converters or date_parser for pandas.read_csv(), then the input should always be a string.)

Suggested solution

I'm not sure how lax you want to be about junk input, but I've tried to preserve your original permissive behavior.

Compared to the solution by @Mathias, this one more clearly expresses the intention to format the output as some variant of YYYY-MM-DD by any means possible.

def cleanup(date): patterns = [ r'(?P<y>\d{4})(?:\.\d*)?', r'(?P<y>\d{4})-(?P<m>\d{1,2})(?:-(?P<d>\d{1,2}))?', r'(?:(?P<d>\d{1,2})-)?(?P<m>\d{1,2})-(?P<y>\d{4})', r'(?P<junk>.*)' ] tests = (re.fullmatch(pattern, str(date)) for pattern in patterns) match = next(match.groupdict() for match in tests if match is not None) return ( '{y}-{m:0>2}-{d:0>2}' if match.get('d') else '{y}-{m:0>2}' if match.get('m') else '{y}' if match.get('y') else '{junk}' ).format_map(match)

I wonder if match = next(filter(None, tests)).groupdict() wouldn't be more readable. — 301_Moved_Permanently, CommentedJul 24, 2018 at 14:56
@200_success I didn't know about named capture groups, so thanks for pointing that out. Re junk input: it's about artists, many of whom lived before there was any consensus on calenders at all. Will have to deal with it. — RolfBly, CommentedJul 24, 2018 at 19:51

301_Moved_Permanently · Accepted Answer · 2018-07-24 09:25:35Z

You don't need this function at all as the date parsing capabilities are built into pandas. The pd.read_csv function have various parameters to automatically parse dates. Look into:

parse_dates to specify which columns to parse;
date_parser to specify a custom function for parsing date (which could be yours only if there weren't better options);
dayfirst to specify to the underlying date parser that 'xx-xx-xxxx' is 'dd-mm-yyyy' instead of 'mm-dd-yyyy' (note that 'xxxx-xx-xx' will still be parsed as 'yyyy-mm-dd').

So all you need to do is to read your data using:

pd.read_csv('path_to_your_file', parse_dates=['name of the date column'], dayfirst=True)

I did try parse_dates. It works if a certain minimum amount of cell values look like a date. It went OK on a slice of the csv, about 230 lines. The full csv has 14K lines, with about 80-90% of the dates just a year. I found that Pandas then interprets those years as floats, and the dates as dates. That's why I decided to write my own date parser. But, please CMIIW. — RolfBly, CommentedJul 24, 2018 at 9:55
@RolfBly You can explicitly specify to use dateutil.parser.parse as date_parser instead of the default pandas algorithm. But year-only dates will somehow be parsed as the 24th of July for the given year. — 301_Moved_Permanently, CommentedJul 24, 2018 at 10:20
Thanks again. At the moment, I think that's another reason for attempting to let the API do more work (i.e. stick to strings). What API returns is datespecs like "not before 1753-07-25; not after 1754" (wrapped in xml). — RolfBly, CommentedJul 24, 2018 at 10:26

RandomDude · Accepted Answer · 2018-07-24 10:25:23Z

Mathias answer is pretty good. I wrote one myself but he was faster :) Anyway here is my solution. As mentioned above precompiling the regex patterns for performance improvements is the way to go here. I would also include a pattern for the 1858.0 case which improves overall readability i think. Since Python 3.6 f-strings are available and make reading code so much easier - you should start using them.

import re PATTERNS = [ r'(\d+)\.\d*', # 1858.0 r'(\d{1,2})-(\d{1,2})-(\d{4})$', # 1-12-1963 r'(\d{4})-(\d{1,2})-(\d{1,2})$', # 1789-7-14 r'(\d{4})-(\d{1,2})$', # 1945-2 r'(\d{1,2})-(\d{4})$' # 2-1883 ] PATTERNS = [re.compile(p) for p in PATTERNS] def cleanup(date): if type(date) not str: date = str(date) for i, p in enumerate(PATTERNS): res = p.match(date) if res and i == 0: return f'{res[1]}' if res and i == 1: return f'{res[3]}-{res[2] :0>2}-{res[1] :0>2}' if res and i == 2: return f'{res[1]}-{res[2] :0>2}-{res[3] :0>2}' if res and i == 3: return f'{res[1]}-{res[2] :0>2}' if res and i == 4: return f'{res[2]}-{res[1] :0>2}' def main(): dates = 1858.0, '1-12-1963', '1945-2', '7-2018', '1789-7-14', for date in dates: print(f'in: {date} out: {cleanup(date)}') if __name__ == "__main__": main()

Update

As Daniel rightful pointed out type(date) not str is illegal syntax. The right way to check if date is not a str:

if not isinstance(date, str): date = str(date)

if type(date) not str: is illegal syntax. Furthermore, it's a common misconception that using compiled regexes is by definition faster than using vanilla regexes. See this StackOverflow question for more information. — Daniel, CommentedJul 24, 2018 at 9:59
But precompiling will never have a negative performance impact right? So worst case it brings no better performance but improves readability? — RandomDude, CommentedJul 24, 2018 at 10:28
In my experience (anecdotal, no absolute evidence) it has had a negative impact before. This is especially the case when one is dealing with a small number of iterations, as is the case here. I'd have to do profiling to confirm or deny, though. — Daniel, CommentedJul 24, 2018 at 12:54

Stack Exchange Network

Cleaning up date strings in Python

5 Answers 5

Suggested solution

Linked

Hot Network Questions

Cleaning up date strings in Python

5 Answers 5

Suggested solution

Linked

Related

Hot Network Questions