Benchmark Suite (v1.1.0)¶

We have compiled, curated, and polished a large suite of benchmark datasets. For reproducibility, the datasets and label vectors are versioned.

Version 1.1.0 of the Benchmark Suite for Clustering Algorithms consists of nine benchmark batteries (dataset collections). The five recommended ones:

wut 🌟,
sipu 🌟,
fcps 🌟,
graves 🌟,
other 🌟,

and four additional ones:

uci,
mnist,
g2mg,
h2mg.

Each battery consists of several datasets of different origins. When referring to a particular benchmark problem, we use the convention “battery/dataset”, e.g, “wut/x2”. Each dataset represents n points in \(\mathbb{R}^d\) and is accompanied by at least one reference partition of cardinality k (a listing follows). The distribution of cluster sizes is summarised below by means of the Gini index g, where \(g=0\) means that all clusters consist of the same number of points.

Note

The versioned snapshots of the suite are available for download at: https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0.

All files can be browsed at: https://github.com/gagolews/clustering-data-v1/tree/v1.1.0.

The datasets and the corresponding ground truth labels can be browsed in the Explore Datasets (v1.1.0) section.

For the results generated by different clustering algorithms, see the Clustering Results Repository (v1.1.0) section.

For example studies featuring different versions of this suite, see https://genieclust.gagolewski.com/ and [24, 26].

The datasets are provided solely for research purposes, unless stated otherwise. As mentioned in the File Format Specification section, each dataset is accompanied by a text file specifying more details thereon (e.g., the literature references that we are asked to cite).

Important

As a courtesy, please cite the original source as well as the current project [21] as well as mention [28] which gives the exact version and URL of the dataset suite. Thank you.

There is some inherent overlap between the original databases. We have tried to resolve any conflicts in the best possible manner. Some datasets are equipped with additional reference labellings that did not appear in the original setting.

`wut`¶

22 datasets in \(\mathbb{R}^2\) or \(\mathbb{R}^3\) authored by the wonderful students of Marek’s 2016/2017 courses on Data Analysis in R and Python at the Faculty of Mathematics and Information Science, Warsaw University of Technology: Anna Gierlak, Eliza Kaczorek, Mateusz Kobyłka, Przemysław Kosewski, Jędrzej Krauze, Michał Maciąg, Aleksander Truszczyński, and Adam Wawrzeńczyk. Thanks!

	dataset	n	d	reference labels	k	noise points	g
1	circles	4000	2	labels0	4	0	0
2	cross	2000	2	labels0	4	0	0
3	graph	2500	2	labels0	10	0	0
4	isolation	9000	2	labels0	3	0	0
5	labirynth	3546	2	labels0	6	0	0.5
6	mk1	300	2	labels0	3	0	0
7	mk2	1000	2	labels0	2	0	0
8	mk3	600	3	labels0	3	0	0
9	mk4	1500	3	labels0	3	0	0
10	olympic	5000	2	labels0	5	0	0
11	smile	1000	2	labels0	6	0	0.4
				labels1	4	0	0.4
12	stripes	5000	2	labels0	2	0	0
13	trajectories	10000	2	labels0	4	0	0
14	trapped_lovers	5000	3	labels0	3	0	0.4
15	twosplashes	400	2	labels0	2	0	0
16	windows	2977	2	labels0	5	0	0.58
17	x1	120	2	labels0	3	0	0.17
18	x2	120	2	labels0	3	0	0.17
				labels1	4	10	0.35
19	x3	185	2	labels0	4	0	0.21
				labels1	3	0	0.42
20	z1	192	2	labels0	3	0	0
21	z2	900	2	labels0	5	0	0.58
22	z3	1000	2	labels0	4	0	0.33

`sipu`¶

An excellent battery of 20 diverse datasets created/compiled/maintained by P. Fränti and his colleagues and research students from the University of Eastern Finland. Available for download from https://cs.joensuu.fi/sipu/datasets/; see [17] for discussion:

a1, a2, a3 [37],
aggregation [29],
birch1, birch2 [64],
compound [63],
d31, r15 [56],
flame [19],
jain [33],
pathbased, spiral [8],
s1, s2, s3, s4 [18],
unbalance [47],
worms_2, worms_64 [49].

We have not included the G2 sets as we suggest the cluster variances should be corrected for space dimensionality; see g2mg for an alternative. Birch3 is not included as no ground-truth labels were provided. We excluded the DIM-sets as they are too easy for most algorithms.

	dataset	n	d	reference labels	k	noise points	g
1	a1	3000	2	labels0	20	0	0
2	a2	5250	2	labels0	35	0	0
3	a3	7500	2	labels0	50	0	0
4	aggregation	788	2	labels0	7	0	0.45
5	birch1	100000	2	labels0	100	0	0.01
6	birch2	100000	2	labels0	100	0	0
7	compound	399	2	labels0	6	0	0.44
				labels1	4	0	0.41
				labels2	5	50	0.48
				labels3	4	50	0.42
				labels4	5	0	0.48
8	d31	3100	2	labels0	31	0	0
9	flame	240	2	labels0	2	0	0.28
				labels1	2	12	0.27
10	jain	373	2	labels0	2	0	0.48
11	pathbased	300	2	labels0	3	0	0.06
				labels1	4	0	0.2
12	r15	600	2	labels0	15	0	0
				labels1	9	0	0.4
				labels2	8	0	0.47
13	s1	5000	2	labels0	15	0	0.03
14	s2	5000	2	labels0	15	0	0.03
15	s3	5000	2	labels0	15	0	0.03
16	s4	5000	2	labels0	15	0	0.03
17	spiral	312	2	labels0	3	0	0.02
18	unbalance	6500	2	labels0	8	0	0.63
19	worms_2	105600	2	labels0	35	0	0.28
20	worms_64	105000	64	labels0	25	0	0

`fcps`¶

Nine datasets from the Fundamental Clustering Problem Suite proposed by A. Ultsch [55] from the Marburg University, Germany.

Each dataset consists of 212–4096 observations in 2–3 dimensions. The GolfBall dataset is not included as it has no cluster structure. Tetragonula and Leukemia are also not part of our suite as they are given as distance matrices.

Data were originally available from elsewhere, but now can be accessed, e.g., via the R package FCPS; see also [52].

	dataset	n	d	reference labels	k	noise points	g
1	atom	800	3	labels0	2	0	0
2	chainlink	1000	3	labels0	2	0	0
3	engytime	4096	2	labels0	2	0	0
				labels1	2	0	0
4	hepta	212	3	labels0	7	0	0.01
5	lsun	400	2	labels0	3	0	0.25
6	target	770	2	labels0	6	0	0.79
				labels1	2	12	0.04
7	tetra	400	3	labels0	4	0	0
8	twodiamonds	800	2	labels0	2	0	0
9	wingnut	1016	2	labels0	2	0	0

`graves`¶

Ten synthetic data sets discussed by D. Graves and W. Pedrycz in [30].

The dataset consist of 200–1050 observations in 2 dimensions. They came with no reference labels, so we had to create them ourselves.

	dataset	n	d	reference labels	k	noise points	g
1	dense	200	2	labels0	2	0	0
2	fuzzyx	1000	2	labels0	5	0	0.06
				labels1	2	138	0.01
				labels2	4	126	0.03
				labels3	2	135	0.01
				labels4	2	130	0.03
3	line	250	2	labels0	2	0	0.6
4	parabolic	1000	2	labels0	2	0	0.02
				labels1	4	0	0.04
5	ring	1000	2	labels0	2	0	0
6	ring_noisy	1050	2	labels0	2	43	0
7	ring_outliers	1030	2	labels0	5	0	0.71
				labels1	2	30	0
8	zigzag	250	2	labels0	3	0	0.4
				labels1	5	0	0.04
9	zigzag_noisy	300	2	labels0	3	38	0.41
				labels1	5	38	0.01
10	zigzag_outliers	280	2	labels0	3	30	0.4
				labels1	5	30	0.04

`other`¶

Datasets from multiple sources:

chameleon_t4_8k, chameleon_t5_8k, chameleon_t7_10k, chameleon_t8_8k – datasets supposedly related to the CHAMELEON algorithm by G. Karypis et al. [35].

Source: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download

In fact, in [35] only two of the above (and some other ones) datasets are studied: chameleon_t7_10k is referred to as DS3, whilst chameleon_t8_8k is nicknamed DS4. The DS2 set mentioned therein looks like a more noisy version of fcps/twodiamonds.
hdbscan – a dataset used for demonstrating the outputs of the hdbscan package for Python [40];
iris, iris5 - “the” (see [3] for discussion) famous Iris [15] dataset and its imbalanced version considered in [25];
square – a dataset of unknown/unconfirmed origin (🚧 help needed 🚧).

	dataset	n	d	reference labels	k	noise points	g
1	chameleon_t4_8k	8000	2	labels0	6	761	0.25
2	chameleon_t5_8k	8000	2	labels0	6	1187	0.03
3	chameleon_t7_10k	10000	2	labels0	9	926	0.47
4	chameleon_t8_8k	8000	2	labels0	8	346	0.37
5	hdbscan	2309	2	labels0	6	510	0.18
6	iris	150	4	labels0	3	0	0
7	iris5	105	4	labels0	3	0	0.43
8	square	1000	2	labels0	2	0	0

`uci`¶

A selection of eight high-dimensional datasets available through the UCI (University of California, Irvine) Machine Learning Repository [12]. Some of them were considered for benchmark purposes in, amongst others, [30]. They are also listed in the sipu battery. However, their original purpose is for testing classification, not clustering algorithms. Most clustering algorithms find them problematic; due to their being high-dimensional, it is difficult to verify the sensibleness of the reference labels.

	dataset	n	d	reference labels	k	g
1	ecoli	336	7	labels0	8	0.65
2	glass	214	9	labels0	6	0.48
3	ionosphere	351	34	labels0	2	0.28
4	sonar	208	60	labels0	2	0.07
5	statlog	2310	19	labels0	7	0
6	wdbc	569	30	labels0	2	0.25
7	wine	178	13	labels0	3	0.13
8	yeast	1484	8	labels0	10	0.63

`mnist`¶

This battery features two large, high-dimensional datasets:

MNIST – a database of handwritten digits (a preprocessed remix of NIST data made by Y. LeCun, C. Cortes, and C.J.C. Burges),
Fashion-MNIST – a similarly-structured dataset of Zalando articles compiled by H. Xiao, K. Rasul, and R. Vollgraf; see [61].

Both datasets consist of 70,000 flattened 28x28 greyscale images (train and test samples combined).

	dataset	n	d	reference labels	k	noise points	g
1	digits	70000	784	labels0	10	0	0.03
2	fashion	70000	784	labels0	10	0	0

`g2mg`¶

Each dataset consists of 2,048 observations from two equisized Gaussian clusters in \(d=1, 2, \dots, 128\) dimensions (the components are sampled independently from a normal distribution).

They can be considered a modified version of the Gaussian G2-sets from https://cs.joensuu.fi/sipu/datasets/, but with variances dependent on datasets’ dimensionalities, i.e., \(s\sqrt{d/2}\) for different s. This makes these new problems more difficult than their original counterparts. The 1-dimensional datasets as well as those of very low and very high variances should probably be discarded.

It is well-known that such a data distribution (multivariate normal with independent components) is subject to the so-called curse of dimensionality, leading to some weird behaviour for high d; see, e.g., the Gaussian Annulus Theorem mentioned in [5].

Generator: https://github.com/gagolews/clustering-data-v1/blob/master/.devel/generate_gKmg.py

We recommend that these datasets be studied separately from other batteries, because they are too plentiful. Also, parametric algorithms that specialise in detecting Gaussian blobs (k-means, expectation-maximisation (EM) for Gaussian mixtures) will naturally perform better thereon than the non-parametric approaches.

	dataset	n	d	reference labels	k	g
1	g2mg_1_10	2048	1	labels0	2	0
				labels1	2	0
2	g2mg_1_20	2048	1	labels0	2	0
				labels1	2	0
3	g2mg_1_30	2048	1	labels0	2	0
				labels1	2	0.01
4	g2mg_1_40	2048	1	labels0	2	0
				labels1	2	0.01
5	g2mg_1_50	2048	1	labels0	2	0
				labels1	2	0.01
6	g2mg_1_60	2048	1	labels0	2	0
				labels1	2	0.02
7	g2mg_1_70	2048	1	labels0	2	0
				labels1	2	0.01
8	g2mg_1_80	2048	1	labels0	2	0
				labels1	2	0
9	g2mg_1_90	2048	1	labels0	2	0
				labels1	2	0
10	g2mg_2_10	2048	2	labels0	2	0
				labels1	2	0
11	g2mg_2_20	2048	2	labels0	2	0
				labels1	2	0
12	g2mg_2_30	2048	2	labels0	2	0
				labels1	2	0
13	g2mg_2_40	2048	2	labels0	2	0
				labels1	2	0.01
14	g2mg_2_50	2048	2	labels0	2	0
				labels1	2	0.01
15	g2mg_2_60	2048	2	labels0	2	0
				labels1	2	0
16	g2mg_2_70	2048	2	labels0	2	0
				labels1	2	0.01
17	g2mg_2_80	2048	2	labels0	2	0
				labels1	2	0.01
18	g2mg_2_90	2048	2	labels0	2	0
				labels1	2	0.02
19	g2mg_4_10	2048	4	labels0	2	0
				labels1	2	0
20	g2mg_4_20	2048	4	labels0	2	0
				labels1	2	0
21	g2mg_4_30	2048	4	labels0	2	0
				labels1	2	0
22	g2mg_4_40	2048	4	labels0	2	0
				labels1	2	0
23	g2mg_4_50	2048	4	labels0	2	0
				labels1	2	0
24	g2mg_4_60	2048	4	labels0	2	0
				labels1	2	0
25	g2mg_4_70	2048	4	labels0	2	0
				labels1	2	0.02
26	g2mg_4_80	2048	4	labels0	2	0
				labels1	2	0.01
27	g2mg_4_90	2048	4	labels0	2	0
				labels1	2	0.01
28	g2mg_8_10	2048	8	labels0	2	0
				labels1	2	0
29	g2mg_8_20	2048	8	labels0	2	0
				labels1	2	0
30	g2mg_8_30	2048	8	labels0	2	0
				labels1	2	0
31	g2mg_8_40	2048	8	labels0	2	0
				labels1	2	0.01
32	g2mg_8_50	2048	8	labels0	2	0
				labels1	2	0.01
33	g2mg_8_60	2048	8	labels0	2	0
				labels1	2	0.01
34	g2mg_8_70	2048	8	labels0	2	0
				labels1	2	0.02
35	g2mg_8_80	2048	8	labels0	2	0
				labels1	2	0.03
36	g2mg_8_90	2048	8	labels0	2	0
				labels1	2	0.03
37	g2mg_16_10	2048	16	labels0	2	0
				labels1	2	0
38	g2mg_16_20	2048	16	labels0	2	0
				labels1	2	0
39	g2mg_16_30	2048	16	labels0	2	0
				labels1	2	0
40	g2mg_16_40	2048	16	labels0	2	0
				labels1	2	0
41	g2mg_16_50	2048	16	labels0	2	0
				labels1	2	0
42	g2mg_16_60	2048	16	labels0	2	0
				labels1	2	0.01
43	g2mg_16_70	2048	16	labels0	2	0
				labels1	2	0.01
44	g2mg_16_80	2048	16	labels0	2	0
				labels1	2	0.02
45	g2mg_16_90	2048	16	labels0	2	0
				labels1	2	0.02
46	g2mg_32_10	2048	32	labels0	2	0
				labels1	2	0
47	g2mg_32_20	2048	32	labels0	2	0
				labels1	2	0
48	g2mg_32_30	2048	32	labels0	2	0
				labels1	2	0
49	g2mg_32_40	2048	32	labels0	2	0
				labels1	2	0.01
50	g2mg_32_50	2048	32	labels0	2	0
				labels1	2	0
51	g2mg_32_60	2048	32	labels0	2	0
				labels1	2	0
52	g2mg_32_70	2048	32	labels0	2	0
				labels1	2	0.02
53	g2mg_32_80	2048	32	labels0	2	0
				labels1	2	0.02
54	g2mg_32_90	2048	32	labels0	2	0
				labels1	2	0.02
55	g2mg_64_10	2048	64	labels0	2	0
				labels1	2	0
56	g2mg_64_20	2048	64	labels0	2	0
				labels1	2	0
57	g2mg_64_30	2048	64	labels0	2	0
				labels1	2	0.01
58	g2mg_64_40	2048	64	labels0	2	0
				labels1	2	0.01
59	g2mg_64_50	2048	64	labels0	2	0
				labels1	2	0.01
60	g2mg_64_60	2048	64	labels0	2	0
				labels1	2	0.01
61	g2mg_64_70	2048	64	labels0	2	0
				labels1	2	0
62	g2mg_64_80	2048	64	labels0	2	0
				labels1	2	0.01
63	g2mg_64_90	2048	64	labels0	2	0
				labels1	2	0.01
64	g2mg_128_10	2048	128	labels0	2	0
				labels1	2	0
65	g2mg_128_20	2048	128	labels0	2	0
				labels1	2	0
66	g2mg_128_30	2048	128	labels0	2	0
				labels1	2	0
67	g2mg_128_40	2048	128	labels0	2	0
				labels1	2	0
68	g2mg_128_50	2048	128	labels0	2	0
				labels1	2	0.01
69	g2mg_128_60	2048	128	labels0	2	0
				labels1	2	0.02
70	g2mg_128_70	2048	128	labels0	2	0
				labels1	2	0.02
71	g2mg_128_80	2048	128	labels0	2	0
				labels1	2	0.01
72	g2mg_128_90	2048	128	labels0	2	0
				labels1	2	0.01

`h2mg`¶

Two Gaussian-like hubs of equal sizes, with spread dependent on datasets’ dimensionalities. Each dataset consists of 2,048 observations in 1, 2, …, 128 dimensions. Each point is sampled from a sphere centred at its own cluster’s centre, of radius that follows the Gaussian distribution with a predefined scaling parameter.

Generator: https://github.com/gagolews/clustering-data-v1/blob/master/.devel/generate_hKmg.py

Just like in the case of g2mg, we recommend that these datasets be studied separately from other batteries.

	dataset	n	d	reference labels	k	g
1	h2mg_1_10	2048	1	labels0	2	0
				labels1	2	0
2	h2mg_1_20	2048	1	labels0	2	0
				labels1	2	0.01
3	h2mg_1_30	2048	1	labels0	2	0
				labels1	2	0.09
4	h2mg_1_40	2048	1	labels0	2	0
				labels1	2	0.22
5	h2mg_1_50	2048	1	labels0	2	0
				labels1	2	0.32
6	h2mg_1_60	2048	1	labels0	2	0
				labels1	2	0.41
7	h2mg_1_70	2048	1	labels0	2	0
				labels1	2	0.47
8	h2mg_1_80	2048	1	labels0	2	0
				labels1	2	0.53
9	h2mg_1_90	2048	1	labels0	2	0
				labels1	2	0.57
10	h2mg_2_10	2048	2	labels0	2	0
				labels1	2	0
11	h2mg_2_20	2048	2	labels0	2	0
				labels1	2	0
12	h2mg_2_30	2048	2	labels0	2	0
				labels1	2	0.01
13	h2mg_2_40	2048	2	labels0	2	0
				labels1	2	0
14	h2mg_2_50	2048	2	labels0	2	0
				labels1	2	0
15	h2mg_2_60	2048	2	labels0	2	0
				labels1	2	0
16	h2mg_2_70	2048	2	labels0	2	0
				labels1	2	0
17	h2mg_2_80	2048	2	labels0	2	0
				labels1	2	0.01
18	h2mg_2_90	2048	2	labels0	2	0
				labels1	2	0.02
19	h2mg_4_10	2048	4	labels0	2	0
				labels1	2	0
20	h2mg_4_20	2048	4	labels0	2	0
				labels1	2	0.01
21	h2mg_4_30	2048	4	labels0	2	0
				labels1	2	0.01
22	h2mg_4_40	2048	4	labels0	2	0
				labels1	2	0.01
23	h2mg_4_50	2048	4	labels0	2	0
				labels1	2	0
24	h2mg_4_60	2048	4	labels0	2	0
				labels1	2	0.02
25	h2mg_4_70	2048	4	labels0	2	0
				labels1	2	0.02
26	h2mg_4_80	2048	4	labels0	2	0
				labels1	2	0.02
27	h2mg_4_90	2048	4	labels0	2	0
				labels1	2	0.02
28	h2mg_8_10	2048	8	labels0	2	0
				labels1	2	0
29	h2mg_8_20	2048	8	labels0	2	0
				labels1	2	0
30	h2mg_8_30	2048	8	labels0	2	0
				labels1	2	0.02
31	h2mg_8_40	2048	8	labels0	2	0
				labels1	2	0.02
32	h2mg_8_50	2048	8	labels0	2	0
				labels1	2	0
33	h2mg_8_60	2048	8	labels0	2	0
				labels1	2	0
34	h2mg_8_70	2048	8	labels0	2	0
				labels1	2	0
35	h2mg_8_80	2048	8	labels0	2	0
				labels1	2	0.01
36	h2mg_8_90	2048	8	labels0	2	0
				labels1	2	0.01
37	h2mg_16_10	2048	16	labels0	2	0
				labels1	2	0
38	h2mg_16_20	2048	16	labels0	2	0
				labels1	2	0
39	h2mg_16_30	2048	16	labels0	2	0
				labels1	2	0
40	h2mg_16_40	2048	16	labels0	2	0
				labels1	2	0.01
41	h2mg_16_50	2048	16	labels0	2	0
				labels1	2	0.02
42	h2mg_16_60	2048	16	labels0	2	0
				labels1	2	0
43	h2mg_16_70	2048	16	labels0	2	0
				labels1	2	0.01
44	h2mg_16_80	2048	16	labels0	2	0
				labels1	2	0
45	h2mg_16_90	2048	16	labels0	2	0
				labels1	2	0
46	h2mg_32_10	2048	32	labels0	2	0
				labels1	2	0
47	h2mg_32_20	2048	32	labels0	2	0
				labels1	2	0.01
48	h2mg_32_30	2048	32	labels0	2	0
				labels1	2	0.01
49	h2mg_32_40	2048	32	labels0	2	0
				labels1	2	0.02
50	h2mg_32_50	2048	32	labels0	2	0
				labels1	2	0.02
51	h2mg_32_60	2048	32	labels0	2	0
				labels1	2	0.01
52	h2mg_32_70	2048	32	labels0	2	0
				labels1	2	0.01
53	h2mg_32_80	2048	32	labels0	2	0
				labels1	2	0
54	h2mg_32_90	2048	32	labels0	2	0
				labels1	2	0.01
55	h2mg_64_10	2048	64	labels0	2	0
				labels1	2	0
56	h2mg_64_20	2048	64	labels0	2	0
				labels1	2	0
57	h2mg_64_30	2048	64	labels0	2	0
				labels1	2	0.01
58	h2mg_64_40	2048	64	labels0	2	0
				labels1	2	0.01
59	h2mg_64_50	2048	64	labels0	2	0
				labels1	2	0
60	h2mg_64_60	2048	64	labels0	2	0
				labels1	2	0.01
61	h2mg_64_70	2048	64	labels0	2	0
				labels1	2	0.01
62	h2mg_64_80	2048	64	labels0	2	0
				labels1	2	0
63	h2mg_64_90	2048	64	labels0	2	0
				labels1	2	0
64	h2mg_128_10	2048	128	labels0	2	0
				labels1	2	0
65	h2mg_128_20	2048	128	labels0	2	0
				labels1	2	0.01
66	h2mg_128_30	2048	128	labels0	2	0
				labels1	2	0
67	h2mg_128_40	2048	128	labels0	2	0
				labels1	2	0.01
68	h2mg_128_50	2048	128	labels0	2	0
				labels1	2	0.01
69	h2mg_128_60	2048	128	labels0	2	0
				labels1	2	0.01
70	h2mg_128_70	2048	128	labels0	2	0
				labels1	2	0.01
71	h2mg_128_80	2048	128	labels0	2	0
				labels1	2	0.01
72	h2mg_128_90	2048	128	labels0	2	0
				labels1	2	0.01

Benchmark Suite (v1.1.0)¶

wut¶

sipu¶

fcps¶

graves¶

other¶

uci¶

mnist¶

g2mg¶

h2mg¶

`wut`¶

`sipu`¶

`fcps`¶

`graves`¶

`other`¶

`uci`¶

`mnist`¶

`g2mg`¶

`h2mg`¶