Trying to normalize data from different sources is always a hassle. This is especially true when working with dates.
Recently I was working with 3 huge data sources where the dates for the customer birth dates were all formatted differently. My goal was to get all of the dates standardized to YYYY-MM-DD for insertion into a MySQL database.
One of the data sources considered it a security risk to include the birth year in the data (seriously??? paranoid much?) so the birth date field came is as d-mm. If you are not familiar with the abbreviation that means that the day did not have leading zeros. So March comes in as 3 and December is 12. Not good for standardization.
Another twist was that if there was no birth date in the system the field was defaulted to N/A. But wait there’s more! The field had a hard coded length that was filled in with spaces. Beautiful! Now this is not just bad data but “Poopy Data”!
So to get to the standard date format I need to:
- Remove any leading or trailing spaces
- Make sure there is a leading zero on months with 1 character
- Add 0000 as the year on the beginning of the date
- Replace any instance of “N/A” with 0000-00-00
Let’s work from the inside out:
- Remove the spaces: TRIM(`birth_date`)
- Replace “N/A”: REPLACE(`birth_date`, ‘N/A’, ’00-00′)
- Make sure there is a leading zero in the month: LPAD(`birth_date`, 5, ‘0’)
- Add 0000 as the year for all records: CONCAT(‘0000’, ‘-‘, `birth_date
Now let’s put it all together:
CONCAT(‘0000’, ‘-‘, LPAD(REPLACE(TRIM(`birth_date`),’N/A’,’00-00′), 5, ‘0’)) AS `birth_date`
No more poopy data!
Now to fix the other file that has the date formatted as DD/MM/YYYY