6
$\begingroup$

I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):

[ 0 5 12 7 ] [ - 0 9 2 ] [ - - 0 6 ] [ - - - 0 ] 

(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.

I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.

A few details:

  • Clusters don't have to be equal sizes
  • I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)
  • Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.

On top of an algo/lib that does that, bonus points for:

  • Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it
  • For such a lib, support for parallel processing
  • Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)

Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.

$\endgroup$

    3 Answers 3

    4
    $\begingroup$

    Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.

    Alternatively, taking a step back, you could try computing distances between

    There is also another library called pyclustering, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.

    $\endgroup$
    1
    • $\begingroup$The last two links do not work anymore.$\endgroup$
      – Veliko
      CommentedNov 17, 2020 at 8:59
    3
    $\begingroup$

    There are hundreds of algorithms to choose from.

    • Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters
    • PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)
    • Spectral clustering
    • DBSCAN
    • OPTICS
    • HDBSCAN*
    • Affinity Propagation
    • ...
    $\endgroup$
      0
      $\begingroup$

      Other than what has been mentioned in the previous answers, take a look at DDCRP (Distance-Dependant Chinese Restaurant Processes) which does not require number of clusters as an input, and has a sequence of threshold and criterions to get the desired clustering.

      $\endgroup$
      1

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.