2

I work for an organization that has lots of databases containing person information. The data quality is poor. One case was a surname I found like this (this is the worst-case scenario):

Mark "Dunno his surname, but it sounded like Lion 'RAR', Ha ha"

There is a date of birth of: 01/09/1499

This is a worst-case scenario. Most data quality issues are due to pressing a wrong key on the keyboard e.g. Snith instead of Smith (n is next me m on the keyboard).

I am looking for algorithms which can help me with some kind of "fuzzy matching" under these circumstances. Our requirements involve several millions of records per day. I have looked for "data matching" and discovered the following algorithms:

SQL SOUNDEX SQL METAPHONE Levenshtein Distance 

Also is there such thing as a possible match for dates of birth? A possible match for a surname were the Levenstein distance is 80%.

Therefore I have two questions:

  1. What algorithms are available except the three specified above?

  2. What approaches are used to match possible addresses

15
  • 2
    Hmm, that I know of, there is no equivalent for birth dates, though a person is far more likely to be off by a digit (and not a day), so a good starting point is something that checks for the existence of birth dates whose digits are off by one.
    – Neil
    CommentedDec 11, 2014 at 13:21
  • 1
    Asking for a "source of information" means asking for an off-site resource, which is off-topic according to the rules of this site. If you delete that part from your question make it more specificially, I see a chance for it not to be closed soon (for example: are you going to improve the data quality by some manual effort? Or are you just trying to live with that level of quality by using some kind of "fuzzy matching"?
    – Doc Brown
    CommentedDec 11, 2014 at 13:52
  • 1
    An obvious additional date check would be age - any age over 90 or under 10 might merit additional review. A birth year of 1499 seems unlikely if it's for a living person.CommentedDec 11, 2014 at 16:09
  • 1
    @w0051977: String dates could be difficult. You'll have to decide how you want to handle strings like 01/02/1980 vs. 02/01/1980: Is it Feb. 01 1980 or Jan. 02 1980?CommentedDec 11, 2014 at 16:58
  • 1
    I can still see no real effort to clarify the question - so voting to close as "unclear what you are asking". The thing which is missing is your use case. It is not even clear if you want to validate newly added data (and maybe reject it when it does not look plausible), change/improve the existing data, or just do some searching within the data without changing it.
    – Doc Brown
    CommentedDec 12, 2014 at 9:08

1 Answer 1

2

For date of birth and people in general, one will need to validate if that person exists and if thier address is valid.

You can do both of these using 3rd party services that will:

  • Validate if the person is a real person
  • Validate and standardize address information

You can do this yourself using algorithms that you have mentioned, but why re-invent the wheel and most likely your implementation will be flawed.

Most of these services are configurable and allow you to send information like name, DOB, phone number and will attempt to look up to see if that person exists.

The same holds true for address information. It will attempt standardize the address, locate it (see if it really exists) and then return the standardized address back to you.

In both of these scenarios, no hits can be returned, which means that the person is not found or the addresss is not found. In those cases you probably have a fake individual or address like the one in your sample data.

Also, sometimes it might return multiple matches or less than 100% matches. For example, if the only piece of valid information is a phone number, it might return back multiple people at that phone number.

For less than absolute matches you might set a confidence factor which you want to match on, like 90%. Anything below that number goes into a report which will require manual intervention. People who don't match should be eliminated from your datastore as it is junk data.

1
  • There are 7 million people to check daily. Therefore I believe I have to reinvent the wheel.
    – w0051977
    CommentedDec 11, 2014 at 23:05

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.