Poopy Data Birth Date

Trying to normalize data from different sources is always a hassle.  This is especially true when working with dates.

Recently I was working with 3 huge data sources where the dates for the customer birth dates were all formatted differently.  My goal was to get all of the dates standardized to YYYY-MM-DD for insertion into a MySQL database.

One of the data sources considered it a security risk to include the birth year in the data (seriously??? paranoid much?) so the birth date field came is as d-mm.  If you are not familiar with the abbreviation that means that the day did not have leading zeros.  So March comes in as 3 and December is 12.  Not good for standardization.

Another twist was that if there was no birth date in the system the field was defaulted to N/A.  But wait there’s more!  The field had a hard coded length that was filled in with spaces.  Beautiful!  Now this is not just bad data but “Poopy Data”!

So to get to the standard date format I need to:

  1. Remove any leading or trailing spaces
  2. Make sure there is a leading zero on months with 1 character
  3. Add 0000 as the year on the beginning of the date
  4. Replace any instance of “N/A” with 0000-00-00

Let’s work from the inside out:

  • Remove the spaces: TRIM(`birth_date`)
  • Replace “N/A”: REPLACE(`birth_date`, ‘N/A’, ’00-00′)
  • Make sure there is a leading zero in the month: LPAD(`birth_date`, 5, ‘0’)
  • Add 0000 as the year for all records: CONCAT(‘0000’, ‘-‘, `birth_date)

Now let’s put it all together:

CONCAT(‘0000’, ‘-‘, LPAD(REPLACE(TRIM(`birth_date`),’N/A’,’00-00′), 5, ‘0’)) AS `birth_date`

No more poopy data!

Now to fix the other file that has the date formatted as DD/MM/YYYY