# True vs Predicted Clusters

## Data

Let $$X=\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$$ be the input dataset that consists of $$n$$ objects.

As an illustration, in this section, we consider the wut/x2 dataset, which features 120 points in $$\mathbb{R}^2$$.

import clustbench
data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
X = benchmark.data
X[:5, :]  # preview
## array([[-0.16911746, -0.25135197],
##        [-0.17667309, -0.48302249],
##        [-0.1423283 , -0.36194485],
##        [-0.10637601, -0.70068834],
##        [-0.49353012, -0.23587515]])


Note

The Python code we give here should be self-explanatory even to the readers who do not know this language (yet1). Also note that in a further section, we explain how to use our benchmark framework from within R and MATLAB.

## Reference Labels

Each dataset like the one above is equipped with a reference2 partition assigned by experts, which represents a grouping of the points into $$k \ge 2$$ clusters. Or, more formally, a k-partition3 $$\{X_1,\dots,X_k\}$$ of $$X$$.

(y_true := benchmark.labels[0])  # the first label vector
## array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
##        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
##        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##        1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
##        3, 3, 3, 3, 3, 3, 3, 3, 3, 3])


More precisely, what we have here is a label vector $$\mathbf{y}$$, where $$y_i\in\{1,\dots,k\}$$ gives the subset/cluster number (ID) of the i-th object.

The number of subsets, k, is thus an inherent part of the reference set:

(k := max(y_true))  # or benchmark.n_clusters[0]
## 3


## Predicted Labels

Let us take any clustering algorithm whose quality we would like to assess. As an example, we will consider the outputs generated by Genie. We need to apply it on $$X$$ to discover a new k-partition:

import genieclust
g = genieclust.Genie(n_clusters=k)  # using default parameters
(y_pred := g.fit_predict(X) + 1)  # +1 makes cluster IDs in 1..k, not 0..(k-1)
## array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,
##        2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 2, 2, 2, 2, 2, 1,
##        2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2,
##        2, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
##        3, 3, 3, 3, 3, 3, 3, 3, 3, 3])


We thus obtained a vector of predicted labels encoding a new grouping, $$\{\hat{X}_1,\dots,\hat{X}_k\}$$. We will denote it with $$\hat{\mathbf{y}}$$.

The figure below depicts the two partitions side by side.

plt.subplot(1, 2, 1)
genieclust.plots.plot_scatter(X, labels=y_true-1, axis="equal", title="y_true")
plt.subplot(1, 2, 2)
genieclust.plots.plot_scatter(X, labels=y_pred-1, axis="equal", title="y_pred")
plt.show()


Important

Ideally, we would like to work with algorithms that yield partitions closely matching the reference ones. This should be true on as wide a set of problems as possible.

We, therefore, need to relate the predicted labels to the reference ones. Note that the automated discovery of $$\hat{\mathbf{y}}$$ never relies on the information included in $$\mathbf{y}$$. In other words, clustering is an unsupervised4 learning process.

## Permuting Cluster IDs

A partition is a set of point groups, and sets are, by definition, unordered. Therefore, the actual cluster IDs do not really matter.

Looking at the above figure, we see that the red (ID=2), black (ID=1), and green (ID=3) discovered clusters can be paired with, respectively, the black (ID=1), red (ID=2), and green (ID=3) reference ones.

We can easily recode5 y_pred so that the cluster IDs nicely correspond to each other:

import numpy as np
o = np.array([2, 1, 3])  # relabelling: 1→2, 2→1, 3→3
(y_pred := o[y_pred-1])  # Python uses 0-based indexing, hence the -1
## array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
##        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2,
##        1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 3, 1, 1, 1, 1, 1, 2,
##        1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1,
##        1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
##        3, 3, 3, 3, 3, 3, 3, 3, 3, 3])


See below for an updated figure.

plt.subplot(1, 2, 1)
genieclust.plots.plot_scatter(X, labels=y_true-1, axis="equal", title="y_true")
plt.subplot(1, 2, 2)
genieclust.plots.plot_scatter(X, labels=y_pred-1, axis="equal", title="y_pred")
plt.show()


## Confusion Matrix

We can determine the confusion matrix $$\mathbf{C}$$ for the two label vectors:

genieclust.compare_partitions.confusion_matrix(y_true, y_pred)
## array([[37, 12,  1],
##        [ 0, 40,  0],
##        [ 0,  0, 30]])


Here, $$c_{i, j}$$ denotes the number of points in the $$i$$-th reference cluster that the algorithm assigned to the $$j$$-th cluster.

Overall, Genie returned a clustering quite similar to the reference one. We can consider 107 ($$c_{1,1}+c_{2, 2}+c_{3,3}=37+40+30$$) out of the 120 input points as correctly grouped. In particular, each of the red and green reference points (the 2nd and the 3rd row) has been properly discovered.

However, the algorithm has miscategorised (at least, as far as the current set of expert labels is concerned) one green and twelve red points. They should have been coloured black.

## External Cluster Validity Measures

A confusion matrix summarises all the information required to judge the similarity between the two partitions. However, if we wish to compare the quality of different algorithms, we would rather have it summarised in the form of a single number.

The adjusted asymmetric accuracy [Gag22b] is an external cluster validity measure (see the Appendix for more details) that quantifies the degree of agreement between the reference $$\mathbf{y}$$ and the predicted $$\hat{\mathbf{y}}$$. It is given by:

$\mathrm{AAA}(\mathbf{C}) = \frac{ \max_\sigma \frac{1}{k} \sum_{i=1}^k \frac{c_{i, \sigma(i)}}{c_{i, 1}+\dots+c_{i, k}} - \frac{1}{k} }{ 1 - \frac{1}{k} }.$

Note that the matching between the cluster labels is performed automatically by finding the best permutation $$\sigma$$ of the set $$\{1,\dots,k\}$$.

Equivalently:

$\mathrm{AAA}(\mathbf{C}) = 1- \min_\sigma \left( \frac{1}{k} \sum_{i=1}^k \frac{c_{i, 1}+\dots+c_{i, k}-c_{i, \sigma(i)}}{ \frac{k-1}{k} (c_{i, 1}+\dots+c_{i, k}) } \right),$

which can be thought of as a measure of the average proportion of correctly identified points in each cluster (“above” the random assignment).

Overall, the more similar the partitions, the greater the similarity score (1 if they are identical). AAA is adjusted for chance and cluster size inequality. In particular, its expected value is equal to 0 if the partitions were picked at random or all the points were put into a single large cluster. In our case:

genieclust.compare_partitions.adjusted_asymmetric_accuracy(y_true, y_pred)
## 0.8700000000000001


or, equivalently:

clustbench.get_score(y_true, y_pred)  # the AAA metric is used by default
## 0.8700000000000001


indicates a decent degree of similarity between the reference and the discovered partitions.

Following [Gag22b], we recommend the use of AAA as an external cluster validity measure. For the sake of completeness, other popular partition similarity scores are discussed in the Appendix. In particular, this is where we argue why the classic (classification) accuracy rate should not be used in our context (this is a common mistake!).

1

2

Later we will note that there may be many possible ways to split a dataset into groups.

3

We say that $$\{X_1,\dots,X_k\}$$ is a $$k$$-partition of $$X$$, whenever $$\bigcup_{i=1}^k X_i=X$$, each $$X_i$$ is nonempty, and the subsets are pairwise disjoint, i.e., $$X_i\cap X_j=\emptyset$$ for $$i\neq j$$.

4

Tasks such as semi-supervised clustering, where the right assignment of some of the input points is known in advance, are not of our interest here. However, the current framework can trivially be adjusted to fit such scenarios.

5

In a later section, we discuss a way to automatically discover an optimal relabelling based on the solution to the maximal assignment problem.