References

How to Cite

When using this framework in research publications, please cite [Gag22a] as specified below. Thank you.

Additionally, please mention the benchmark suite project [G+22] as well as the literature references listed in the description files corresponding to each dataset studied.

Bibliography

AGM+13

Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243–256. DOI: https://doi.org/10.1016/j.patcog.2012.07.021.

BH65

Ball G.H., Hall D.J. (1965). ISODATA: A novel method of data analysis and pattern classification. Technical Report AD699616, Stanford Research Institute.

BKK+99

Bezdek J.C., Keller J.M., Krishnapuram R., Kuncheva L.I., Pal N.R. (1999). Will the real iris data please stand up? IEEE Transactions on Fuzzy Systems, 7(3):368–369. DOI: 10.1109/91.771092.

BP98

Bezdek J.C., Pal N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(3):301–315. DOI: 10.1109/3477.678624.

BHK20

Blum A., Hopcroft J., Kannan R. (2020). Foundations of Data Science. Cambridge University Press. URL: https://www.cs.cornell.edu/jeh/book.pdf.

B+13

Buitinck L., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122.

CH74

Caliński T., Harabasz J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1):1–27. DOI: 10.1080/03610927408827101.

CY08

Chang H., Yeung D.Y. (2008). Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203.

Cro16

Crouse D.F. (2016). On implementing 2D rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696. DOI: 10.1109/TAES.2016.140952.

DN09

Dasgupta S., Ng V. (2009). Single data, multiple clusterings. In: Proc. NIPS Workshop Clustering: Science or Art? Towards Principled Approaches. URL: https://clusteringtheory.org.

DB79

Davies D.L., Bouldin D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI–1(2):224–227. DOI: 10.1109/TPAMI.1979.4766909.

DG22

Dua D., Graff C. (2022). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

Dun74

Dunn J.C. (1974). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3):32–57. DOI: 10.1080/01969727308546046.

ECS65

Edwards A.W.F., Cavalli-Sforza L.L. (1965). A method for cluster analysis. Biometrics, 21(2):362–375. DOI: 10.2307/2528096.

Fis36

Fisher R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188.

FM83

Fowlkes E.B., Mallows C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.

FS18

Fränti P., Sieranoja S. (2018). K-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12):4743–4759. DOI: 10.1007/s10489-018-1238-7.

FV06

Fränti P., Virmajoki O. (2006). Iterative shrinking method for clustering problems. Pattern Recognition, 39(5):761–765.

FM07

Fu L., Medico E. (2007). FLAME: A novel fuzzy clustering method for the analysis of DNA microarray data. BMC bioinformatics, 8:3.

Gag21

Gagolewski M. (2021). genieclust: Fast and robust hierarchical clustering. SoftwareX, 15:100722. DOI: 10.1016/j.softx.2021.100722.

Gag22a

Gagolewski M. (2022). A framework for benchmarking clustering algorithms. under review (preprint). URL: https://clustering-benchmarks.gagolewski.com, DOI: 10.48550/arXiv.2209.09493.

Gag22b

Gagolewski M. (2022). Adjusted asymmetric accuracy: A well-behaving external cluster validity measure. under review (preprint). URL: https://arxiv.org/pdf/2209.02935.pdf, DOI: 10.48550/arXiv.2209.02935.

Gag22c

Gagolewski M. (2022). Minimalist Data Wrangling with Python. Zenodo, Melbourne. ISBN 978-0-6455719-1-2. URL: https://datawranglingpy.gagolewski.com/, DOI: 10.5281/zenodo.6451068.

GBC16

Gagolewski M., Bartoszuk M., Cena A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences, 363:8–23. DOI: 10.1016/j.ins.2016.05.003.

GBC21

Gagolewski M., Bartoszuk M., Cena A. (2021). Are cluster validity measures (in)valid? Information Sciences, 581:620–636. DOI: 10.1016/j.ins.2021.10.004.

G+22

Gagolewski M., et al. (2022). A benchmark suite for clustering algorithms: version 1.1.0. URL: https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0, DOI: 10.5281/zenodo.7088171.

GMT07

Gionis A., Mannila H., Tsaparas P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1):4.

GP10

Graves D., Pedrycz W. (2010). Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study. Fuzzy Sets and Systems, 161:522–543. DOI: 10.1016/j.fss.2009.10.021.

HBV01

Halkidi M., Batistakis Y., Vazirgiannis M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, pages 107–145. DOI: 10.1023/A:1012801612483.

HA85

Hubert L., Arabie P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218. DOI: 10.1007/BF01908075.

JL05

Jain A., Law M. (2005). Data clustering: A user's dilemma. Lecture Notes in Computer Science, 3776:1–10.

JYZ13

Jamil M., Yang X.-S., Zepernick H.-J. (2013). 8-test functions for global optimization: A comprehensive survey. In: Swarm Intelligence and Bio-Inspired Computation, pp. 193–222. DOI: 10.1016/B978-0-12-405163-8.00008-9.

KHK99

Karypis G., Han E.H., Kumar V. (1999). CHAMELEON: Hierarchical clustering using dynamic modeling. Computer, 32(8):68–75. DOI: 10.1109/2.781637.

Kva87

Kvalseth T.O. (1987). Entropy and correlation: Some comments. IEEE Trans. Systems, Man and Cybernetics, 17(3):517–519. DOI: 10.1109/TSMC.1987.4309069.

KF02

Kärkkäinen I., Fränti P. (2002). Dynamic local search algorithm for the clustering problem. In: Proc. 16th Intl. Conf. Pattern Recognition'02, volume 2, pp. 240–243. IEEE.

Llo57

Lloyd S.P. (1957). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:128–137. Originally a 1957 Bell Telephone Laboratories Research Report; republished in 1982. DOI: 10.1109/TIT.1982.1056489.

MB02

Maulik U., Bandyopadhyay S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1650–1654. DOI: 10.1109/TPAMI.2002.1114856.

MHA17

McInnes L., Healy J., Astels S. (2017). hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11):205. DOI: 10.21105/joss.00205.

MH01

Meilă M., Heckerman D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42:9–29. DOI: 10.1023/A:1007648401407.

MC85

Milligan G.W., Cooper M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.

MNL12

Müller A.C., Nowozin S., Lampert C.H. (2012). Information theoretic clustering using minimum spanning trees. In: Proc. German Conference on Pattern Recognition. URL: https://github.com/amueller/information-theoretic-mst.

Mul13

Müllner D. (2013). fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software, 53(9):1–18. DOI: 10.18637/jss.v053.i09.

P+11

Pedregosa F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(85):2825–2830. URL: http://jmlr.org/papers/v12/pedregosa11a.html.

Ran71

Rand W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850. DOI: 10.2307/2284239.

RF16

Rezaei M., Fränti P. (2016). Set matching measures for external cluster validity. IEEE Transactions on Knowledge and Data Engineering, 28(8):2173–2186. DOI: 10.1109/TKDE.2016.2551240.

Rou87

Rousseeuw P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. DOI: 10.1016/0377-0427(87)90125-7.

SF19

Sieranoja S., Fränti P. (2019). Fast and general density peaks clustering. Pattern Recognition Letters, 128:551–558. DOI: 10.1016/j.patrec.2019.10.019.

Ste04

Steinley D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. DOI: 10.1037/1082-989X.9.3.386.

TU20

Thrun M.C., Ultsch A. (2020). Clustering benchmark datasets exploiting the fundamental clustering problems. Data in Brief, 30:105501. DOI: 10.1016/j.dib.2020.105501.

Ult05

Ultsch A. (2005). Clustering with SOM: U*C. In: Workshop on Self-Organizing Maps, pp. 75–82.

VRB02

Veenman C.J., Reinders M.J.T., Backer E. (2002). A maximum variance cluster algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1273–1280.

VEB10

Vinh N.X., Epps J., Bailey J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(95):2837–2854. URL: http://jmlr.org/papers/v11/vinh10a.html.

vLWG12

von Luxburg U., Williamson R.C., Guyon I. (2012). Clustering: science or art? In: Guyon I., et al., editors, Proc. ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proc. Machine Learning Research, pp. 65–79.

War63

Ward Jr. J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244. DOI: 10.1080/01621459.1963.10500845.

W+14

Weise T., et al. (2014). Benchmarking optimization algorithms: An open source framework for the traveling salesman problem. IEEE Computational Intelligence Magazine, 9(3):40–52. DOI: 10.1109/MCI.2014.2326101.

XRV17

Xiao H., Rasul K., Vollgraf R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. URL: https://arxiv.org/pdf/1708.07747.pdf, DOI: 10.48550/arXiv.1708.07747.

XZLL20

Xu Q., Zhang Q., Liu J., Luo B. (2020). Efficient synthetical clustering validity indexes for hierarchical clustering. Expert Systems with Applications, 151:113367. DOI: 10.1016/j.eswa.2020.113367.

Zah71

Zahn C.T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1):68–86.

ZRL97

Zhang T., Ramakrishnan R., Livny M. (1997). BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182. DOI: 10.1023/A:1009783824328.