I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):
[ 0 5 12 7 ] [ - 0 9 2 ] [ - - 0 6 ] [ - - - 0 ]
(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.
I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.
A few details:
- Clusters don't have to be equal sizes
- I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)
- Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.
On top of an algo/lib that does that, bonus points for:
- Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it
- For such a lib, support for parallel processing
- Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)
Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.