Here's the data I have:
Text from articles from various music blogs & music news sites (title, summary, full content, and sometimes tags).
I used a couple different NLP/NER tools (nltk, spacy, and stanford NER) to determine the proper nouns in the text, and gave each proper noun a score based on how many times it appeared, and how many NLP tools recognized it as a proper noun. None of these tools are very accurate by themselves for my data
For each proper noun I queried musicbrainz to find artists with that name. (musicbrainz has a lot of data that may be helpful: aliases, discography, associations with other artists)
Any links in the article to Spotify, YouTube etc. and the song name & artist for that link
I have three goals:
- Determine which proper nouns are artists
- For artists that share the same name, determine which one the text is referring to (based on musicbrainz data)
- Determine if the artist is important to the article, or if they were just briefly mentioned
I have manually tagged some of the data with the correct output for the above 3 goals.
How would you go about this? Which algorithms do you think would be best for these goals?
Is there any semi-supervised learning I can do to reduce the amount of tagging I need to do?