Clustering Results Repository (v1.1.0)

We have also prepared a repository of clustering results for problems from our Benchmark Suite (v1.1.0).

A non-interactive results catalogue is available here.

Methods

Currently, the outputs of the following methods are included. Where applicable, we considered a wide range of control parameters.

  • k-means, Gaussian mixtures, spectral, and Birch available in sklearn 0.23.1 (Python) [B+13, P+11];

  • hierarchical agglomerative methods with the average, centroid, complete, median, Ward, and weighted/McQuitty linkages implemented in fastcluster 1.1.26 (Python/R) [Mul13]

  • genieclust 1.0.0 (Python/R) [Gag21, GBC16] (note that Genie with g=1.0 is equivalent to the single linkage algorithm);

  • ITM git commit 178fd43 (Python) [MNL12] – an “information-theoretic” algorithm based on minimum spanning trees;

  • optim_cvi — local maxima (great effort was made to maximise the probability of their being high-quality ones) of many internal cluster validity measures, including the Caliński–Harabasz, Dunn, or Silhouette index; see [GBC21].

New results will be added in the future (note that we can only consider methods that allow for setting the precise number of generated clusters). Quality contributions are welcome.

Feature Engineering

The algorithms operated on the original data spaces, i.e., subject to only some mild preprocessing:

  • columns of zero variance have been removed;

  • a tiny amount of white noise has been added to each datum to make sure the distance matrices consist of unique elements (this guarantees that the results of hierarchical clustering algorithms are unambiguous);

  • all data were translated and scaled proportionally to assure that each column is of mean 0 and that the total variance of all entries is 1 (this is not standardisation).

Note, however, that spectral clustering and Gaussian mixtures can be considered as ones that modifies the input data space.

Overall, comparisons between distance-based methods that apply automated feature engineering/selection and those that only operate on raw inputs are not exactly fair. In such settings, the classical methods should be run on the transformed data spaces as well.