I have a huge dataset: Last name, first name, date of birth of Indian residents and I need to match them for similarity.
The matching is fuzzy, the data looks like this (names are fictitious for the example):
last name, first name, date of birth John;Doe;01-01-2003 Doe;John;01-01-2003 John Doe;;01-01-2003
I've had some success with the comparison in principle - I'm using the Levenshtein algorithm.
Now the question of encoding data for the neural network has come up. The dataset is large and I plan to use embedding, but I don't have a dictionary of names
What should be done in that case? Is there any other method to implement encoding?