References#

How to Cite#

When using this framework in research publications, please cite [21] as specified below. Thank you.

Additionally, please mention the benchmark suite project [28] as well as the literature references listed in the description files corresponding to each dataset studied.

Bibliography#

[1]

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., and Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243–256. DOI: https://doi.org/10.1016/j.patcog.2012.07.021.

[2]

Ball, G.H. and Hall, D.J. (1965). ISODATA: A novel method of data analysis and pattern classification. Technical Report AD699616, Stanford Research Institute.

[3]

Bezdek, J.C., Keller, J.M., Krishnapuram, R., Kuncheva, L.I., and Pal, N.R. (1999). Will the real iris data please stand up? IEEE Transactions on Fuzzy Systems, 7(3):368–369. DOI: 10.1109/91.771092.

[4]

Bezdek, J.C. and Pal, N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(3):301–315. DOI: 10.1109/3477.678624.

[5]

Blum, A., Hopcroft, J., and Kannan, R. (2020). Foundations of Data Science. Cambridge University Press. URL: https://www.cs.cornell.edu/jeh/book.pdf.

[6]

Buitinck, L. and others. (2013). API design for machine learning software: Experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122.

[7]

Caliński, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1):1–27. DOI: 10.1080/03610927408827101.

[8]

Chang, H. and Yeung, D.Y. (2008). Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203.

[9]

Crouse, D.F. (2016). On implementing 2D rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696. DOI: 10.1109/TAES.2016.140952.

[10]

Dasgupta, S. and Ng, V. (2009). Single data, multiple clusterings. In: Proc. NIPS Workshop Clustering: Science or Art? Towards Principled Approaches. URL: https://clusteringtheory.org.

[11]

Davies, D.L. and Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI–1(2):224–227. DOI: 10.1109/TPAMI.1979.4766909.

[12]

Dua, D. and Graff, C. (2022). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

[13]

Dunn, J.C. (1974). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3):32–57. DOI: 10.1080/01969727308546046.

[14]

Edwards, A.W.F. and Cavalli-Sforza, L.L. (1965). A method for cluster analysis. Biometrics, 21(2):362–375. DOI: 10.2307/2528096.

[15]

Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188.

[16]

Fowlkes, E.B. and Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.

[17]

Fränti, P. and Sieranoja, S. (2018). K-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12):4743–4759. DOI: 10.1007/s10489-018-1238-7.

[18]

Fränti, P. and Virmajoki, O. (2006). Iterative shrinking method for clustering problems. Pattern Recognition, 39(5):761–765.

[19]

Fu, L. and Medico, E. (2007). FLAME: A novel fuzzy clustering method for the analysis of DNA microarray data. BMC bioinformatics, 8:3.

[20]

Gagolewski, M. (2021). genieclust: Fast and robust hierarchical clustering. SoftwareX, 15:100722. URL: https://genieclust.gagolewski.com/, DOI: 10.1016/j.softx.2021.100722.

[21]

Gagolewski, M. (2022). A framework for benchmarking clustering algorithms. SoftwareX, 20:101270. URL: https://clustering-benchmarks.gagolewski.com/, DOI: 10.1016/j.softx.2022.101270.

[22]

Gagolewski, M. (2022). Minimalist Data Wrangling with Python. Zenodo, Melbourne. ISBN 978-0-6455719-1-2. URL: https://datawranglingpy.gagolewski.com/, DOI: 10.5281/zenodo.6451068.

[23]

Gagolewski, M. (2023). Deep R Programming. Zenodo, Melbourne. ISBN 978-0-6455719-2-9. URL: https://deepr.gagolewski.com/, DOI: 10.5281/zenodo.7490464.

[24]

Gagolewski, M. (2023). Normalised clustering accuracy: An asymmetric external cluster validity measure. under review (preprint). URL: https://arxiv.org/pdf/2209.02935.pdf, DOI: 10.48550/arXiv.2209.02935.

[25]

Gagolewski, M., Bartoszuk, M., and Cena, A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences, 363:8–23. URL: https://arxiv.org/pdf/2209.05757, DOI: 10.1016/j.ins.2016.05.003.

[26]

Gagolewski, M., Bartoszuk, M., and Cena, A. (2021). Are cluster validity measures (in)valid? Information Sciences, 581:620–636. URL: https://arxiv.org/pdf/2208.01261, DOI: 10.1016/j.ins.2021.10.004.

[27]

Gagolewski, M., Cena, A., Bartoszuk, M., and Brzozowski, L. (2023). Clustering with minimum spanning trees: How good can it be? under review (preprint). DOI: 10.48550/arXiv.2303.05679.

[28]

Gagolewski, M. and others. (2022). A benchmark suite for clustering algorithms: Version 1.1.0. URL: https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0, DOI: 10.5281/zenodo.7088171.

[29]

Gionis, A., Mannila, H., and Tsaparas, P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1):4.

[30]

Graves, D. and Pedrycz, W. (2010). Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study. Fuzzy Sets and Systems, 161:522–543. DOI: 10.1016/j.fss.2009.10.021.

[31]

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, pages 107–145. DOI: 10.1023/A:1012801612483.

[32]

Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218. DOI: 10.1007/BF01908075.

[33]

Jain, A. and Law, M. (2005). Data clustering: A user's dilemma. Lecture Notes in Computer Science, 3776:1–10.

[34]

Jamil, M., Yang, X.-S., and Zepernick, H.-J. (2013). 8-test functions for global optimization: A comprehensive survey. In: Swarm Intelligence and Bio-Inspired Computation, pp. 193–222. DOI: 10.1016/B978-0-12-405163-8.00008-9.

[35]

Karypis, G., Han, E.H., and Kumar, V. (1999). CHAMELEON: Hierarchical clustering using dynamic modeling. Computer, 32(8):68–75. DOI: 10.1109/2.781637.

[36]

Kvalseth, T.O. (1987). Entropy and correlation: Some comments. IEEE Trans. Systems, Man and Cybernetics, 17(3):517–519. DOI: 10.1109/TSMC.1987.4309069.

[37]

Kärkkäinen, I. and Fränti, P. (2002). Dynamic local search algorithm for the clustering problem. In: Proc. 16th Intl. Conf. Pattern Recognition'02, volume 2, pp. 240–243. IEEE.

[38]

Lloyd, S.P. (1957). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:128–137. Originally a 1957 Bell Telephone Laboratories Research Report; republished in 1982. DOI: 10.1109/TIT.1982.1056489.

[39]

Maulik, U. and Bandyopadhyay, S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1650–1654. DOI: 10.1109/TPAMI.2002.1114856.

[40]

McInnes, L., Healy, J., and Astels, S. (2017). hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11):205. DOI: 10.21105/joss.00205.

[41]

Meilă, M. and Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42:9–29. DOI: 10.1023/A:1007648401407.

[42]

Milligan, G.W. and Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.

[43]

Müller, A.C., Nowozin, S., and Lampert, C.H. (2012). Information theoretic clustering using minimum spanning trees. In: Proc. German Conference on Pattern Recognition. URL: https://github.com/amueller/information-theoretic-mst.

[44]

Müllner, D. (2013). fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software, 53(9):1–18. DOI: 10.18637/jss.v053.i09.

[45]

Pedregosa, F. and others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(85):2825–2830. URL: http://jmlr.org/papers/v12/pedregosa11a.html.

[46]

Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850. DOI: 10.2307/2284239.

[47]

Rezaei, M. and Fränti, P. (2016). Set matching measures for external cluster validity. IEEE Transactions on Knowledge and Data Engineering, 28(8):2173–2186. DOI: 10.1109/TKDE.2016.2551240.

[48]

Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. DOI: 10.1016/0377-0427(87)90125-7.

[49]

Sieranoja, S. and Fränti, P. (2019). Fast and general density peaks clustering. Pattern Recognition Letters, 128:551–558. DOI: 10.1016/j.patrec.2019.10.019.

[50]

Steinley, D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. DOI: 10.1037/1082-989X.9.3.386.

[51]

Thrun, M.C. and Stier, Q. (2021). Fundamental clustering algorithms suite. SoftwareX, 13:100642. DOI: 10.1016/j.softx.2020.100642.

[52]

Thrun, M.C. and Ultsch, A. (2020). Clustering benchmark datasets exploiting the fundamental clustering problems. Data in Brief, 30:105501. DOI: 10.1016/j.dib.2020.105501.

[53]

Ullmann, T., Beer, A., Hünemörder, M., Seidl, T., and Boulesteix, A.-L. (2022). Over-optimistic evaluation and reporting of novel cluster algorithms: An illustrative study. Advances in Data Analysis and Classification. DOI: 10.1007/s11634-022-00496-5.

[54]

Ullmann, T., Hennig, C., and Boulesteix, A.-L. (2022). Validation of cluster analysis results on validation data: A systematic framework. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3):e14444. DOI: 10.1002/widm.1444.

[55]

Ultsch, A. (2005). Clustering with SOM: U*C. In: Workshop on Self-Organizing Maps, pp. 75–82.

[56]

Veenman, C.J., Reinders, M.J.T., and Backer, E. (2002). A maximum variance cluster algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1273–1280.

[57]

Vinh, N.X., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(95):2837–2854. URL: http://jmlr.org/papers/v11/vinh10a.html.

[58]

von Luxburg, U., Williamson, R.C., and Guyon, I. (2012). Clustering: science or art? In: Guyon, I. and others, editors, Proc. ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proc. Machine Learning Research, pp. 65–79.

[59]

Ward Jr., J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244. DOI: 10.1080/01621459.1963.10500845.

[60]

Weise, T. and others. (2014). Benchmarking optimization algorithms: An open source framework for the traveling salesman problem. IEEE Computational Intelligence Magazine, 9(3):40–52. DOI: 10.1109/MCI.2014.2326101.

[61]

Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. URL: https://arxiv.org/pdf/1708.07747.pdf, DOI: 10.48550/arXiv.1708.07747.

[62]

Xu, Q., Zhang, Q., Liu, J., and Luo, B. (2020). Efficient synthetical clustering validity indexes for hierarchical clustering. Expert Systems with Applications, 151:113367. DOI: 10.1016/j.eswa.2020.113367.

[63]

Zahn, C.T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1):68–86.

[64]

Zhang, T., Ramakrishnan, R., and Livny, M. (1997). BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182. DOI: 10.1023/A:1009783824328.

[65]

van Mechelen, I., Boulesteix, A.-L., Dangl, R., and others. (2023). A white paper on good research practices in benchmarking: The case of cluster analysis. DOI: 10.1002/widm.1511.