I work for an organization that has lots of databases containing person information. The data quality is poor. One case was a surname I found like this (this is the worst-case scenario):
Mark "Dunno his surname, but it sounded like Lion 'RAR', Ha ha"
There is a date of birth of: 01/09/1499
This is a worst-case scenario. Most data quality issues are due to pressing a wrong key on the keyboard e.g. Snith instead of Smith (n is next me m on the keyboard).
I am looking for algorithms which can help me with some kind of "fuzzy matching" under these circumstances. Our requirements involve several millions of records per day. I have looked for "data matching" and discovered the following algorithms:
SQL SOUNDEX SQL METAPHONE Levenshtein Distance
Also is there such thing as a possible match for dates of birth? A possible match for a surname were the Levenstein distance is 80%.
Therefore I have two questions:
What algorithms are available except the three specified above?
What approaches are used to match possible addresses