I have a column in a Pandas Dataframe containing birth dates in object/string format:
0 16MAR39 1 21JAN56 2 18NOV51 3 05MAR64 4 05JUN48
I want to convert the to date formatting for processing. I have used
#Convert String to Datetime type data['BIRTH'] = pd.to_datetime(data['BIRTH'])
but the result is ...
0 2039-03-16 1 2056-01-21 2 2051-11-18 3 2064-03-05 4 2048-06-05 Name: BIRTH, dtype: datetime64[ns]
Clearly the dates have the wrong century prefix ("20" instead of "19")
I handled this using ...
data['BIRTH'] = np.where(data['BIRTH'].dt.year > 2000, data['BIRTH'] - pd.offsets.DateOffset(years=100), data['BIRTH'])
Result
0 1939-03-16 1 1956-01-21 2 1951-11-18 3 1964-03-05 4 1948-06-05 Name: BIRTH, Length: 10302, dtype: datetime64[ns]
I am wondering:
- if there is a way to process the data that will get it right first time?
- If there is a better way to process the data after the incorrect conversion.
I'm an amateur coder and as far as I understand things Pandas is optimised for processing efficiency. So I wanted to use the Pandas datatime module for that reason. But is it better to consider Numpy's or Pandas' datetime module here? I know this dataset is small but I am trying to improve my skills so that when I am working on larger datasets I know what to consider.