References
How to Cite
When using this framework in research publications, please cite [21] as specified below. Thank you.
Additionally, please mention the benchmark suite project [27] as well as the literature references listed in the description files corresponding to each dataset studied.
Bibliography
- 1
Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243–256. DOI: https://doi.org/10.1016/j.patcog.2012.07.021.
- 2
Ball G.H., Hall D.J. (1965). ISODATA: A novel method of data analysis and pattern classification. Technical Report AD699616, Stanford Research Institute.
- 3
Bezdek J.C., Keller J.M., Krishnapuram R., Kuncheva L.I., Pal N.R. (1999). Will the real iris data please stand up? IEEE Transactions on Fuzzy Systems, 7(3):368–369. DOI: 10.1109/91.771092.
- 4
Bezdek J.C., Pal N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(3):301–315. DOI: 10.1109/3477.678624.
- 5
Blum A., Hopcroft J., Kannan R. (2020). Foundations of Data Science. Cambridge University Press. URL: https://www.cs.cornell.edu/jeh/book.pdf.
- 6
Buitinck L., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122.
- 7
Caliński T., Harabasz J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1):1–27. DOI: 10.1080/03610927408827101.
- 8
Chang H., Yeung D.Y. (2008). Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203.
- 9
Crouse D.F. (2016). On implementing 2D rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696. DOI: 10.1109/TAES.2016.140952.
- 10
Dasgupta S., Ng V. (2009). Single data, multiple clusterings. In: Proc. NIPS Workshop Clustering: Science or Art? Towards Principled Approaches. URL: https://clusteringtheory.org.
- 11
Davies D.L., Bouldin D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI–1(2):224–227. DOI: 10.1109/TPAMI.1979.4766909.
- 12
Dua D., Graff C. (2022). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
- 13
Dunn J.C. (1974). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3):32–57. DOI: 10.1080/01969727308546046.
- 14
Edwards A.W.F., Cavalli-Sforza L.L. (1965). A method for cluster analysis. Biometrics, 21(2):362–375. DOI: 10.2307/2528096.
- 15
Fisher R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188.
- 16
Fowlkes E.B., Mallows C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.
- 17
Fränti P., Sieranoja S. (2018). K-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12):4743–4759. DOI: 10.1007/s10489-018-1238-7.
- 18
Fränti P., Virmajoki O. (2006). Iterative shrinking method for clustering problems. Pattern Recognition, 39(5):761–765.
- 19
Fu L., Medico E. (2007). FLAME: A novel fuzzy clustering method for the analysis of DNA microarray data. BMC bioinformatics, 8:3.
- 20
Gagolewski M. (2021). genieclust: Fast and robust hierarchical clustering. SoftwareX, 15:100722. DOI: 10.1016/j.softx.2021.100722.
- 21
Gagolewski M. (2022). A framework for benchmarking clustering algorithms. SoftwareX, 20:101270. URL: https://clustering-benchmarks.gagolewski.com, DOI: 10.1016/j.softx.2022.101270.
- 22
Gagolewski M. (2022). Adjusted asymmetric accuracy: A well-behaving external cluster validity measure. under review (preprint). URL: https://arxiv.org/pdf/2209.02935.pdf, DOI: 10.48550/arXiv.2209.02935.
- 23
Gagolewski M. (2022). Minimalist Data Wrangling with Python. Zenodo, Melbourne. ISBN 978-0-6455719-1-2. URL: https://datawranglingpy.gagolewski.com/, DOI: 10.5281/zenodo.6451068.
- 24
Gagolewski M. (2023). Deep R Programming. Zenodo, Melbourne. ISBN 978-0-6455719-2-9 (reserved). early draft. URL: https://deepr.gagolewski.com/, DOI: 10.5281/zenodo.7490464.
- 25
Gagolewski M., Bartoszuk M., Cena A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences, 363:8–23. DOI: 10.1016/j.ins.2016.05.003.
- 26
Gagolewski M., Bartoszuk M., Cena A. (2021). Are cluster validity measures (in)valid? Information Sciences, 581:620–636. DOI: 10.1016/j.ins.2021.10.004.
- 27
Gagolewski M., et al. (2022). A benchmark suite for clustering algorithms: version 1.1.0. URL: https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0, DOI: 10.5281/zenodo.7088171.
- 28
Gionis A., Mannila H., Tsaparas P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1):4.
- 29
Graves D., Pedrycz W. (2010). Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study. Fuzzy Sets and Systems, 161:522–543. DOI: 10.1016/j.fss.2009.10.021.
- 30
Halkidi M., Batistakis Y., Vazirgiannis M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, pages 107–145. DOI: 10.1023/A:1012801612483.
- 31
Hubert L., Arabie P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218. DOI: 10.1007/BF01908075.
- 32
Jain A., Law M. (2005). Data clustering: A user's dilemma. Lecture Notes in Computer Science, 3776:1–10.
- 33
Jamil M., Yang X.-S., Zepernick H.-J. (2013). 8-test functions for global optimization: A comprehensive survey. In: Swarm Intelligence and Bio-Inspired Computation, pp. 193–222. DOI: 10.1016/B978-0-12-405163-8.00008-9.
- 34
Karypis G., Han E.H., Kumar V. (1999). CHAMELEON: Hierarchical clustering using dynamic modeling. Computer, 32(8):68–75. DOI: 10.1109/2.781637.
- 35
Kvalseth T.O. (1987). Entropy and correlation: Some comments. IEEE Trans. Systems, Man and Cybernetics, 17(3):517–519. DOI: 10.1109/TSMC.1987.4309069.
- 36
Kärkkäinen I., Fränti P. (2002). Dynamic local search algorithm for the clustering problem. In: Proc. 16th Intl. Conf. Pattern Recognition'02, volume 2, pp. 240–243. IEEE.
- 37
Lloyd S.P. (1957). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:128–137. Originally a 1957 Bell Telephone Laboratories Research Report; republished in 1982. DOI: 10.1109/TIT.1982.1056489.
- 38
Maulik U., Bandyopadhyay S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1650–1654. DOI: 10.1109/TPAMI.2002.1114856.
- 39
McInnes L., Healy J., Astels S. (2017). hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11):205. DOI: 10.21105/joss.00205.
- 40
Meilă M., Heckerman D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42:9–29. DOI: 10.1023/A:1007648401407.
- 41
Milligan G.W., Cooper M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.
- 42
Müller A.C., Nowozin S., Lampert C.H. (2012). Information theoretic clustering using minimum spanning trees. In: Proc. German Conference on Pattern Recognition. URL: https://github.com/amueller/information-theoretic-mst.
- 43
Müllner D. (2013). fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software, 53(9):1–18. DOI: 10.18637/jss.v053.i09.
- 44
Pedregosa F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(85):2825–2830. URL: http://jmlr.org/papers/v12/pedregosa11a.html.
- 45
Rand W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850. DOI: 10.2307/2284239.
- 46
Rezaei M., Fränti P. (2016). Set matching measures for external cluster validity. IEEE Transactions on Knowledge and Data Engineering, 28(8):2173–2186. DOI: 10.1109/TKDE.2016.2551240.
- 47
Rousseeuw P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. DOI: 10.1016/0377-0427(87)90125-7.
- 48
Sieranoja S., Fränti P. (2019). Fast and general density peaks clustering. Pattern Recognition Letters, 128:551–558. DOI: 10.1016/j.patrec.2019.10.019.
- 49
Steinley D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. DOI: 10.1037/1082-989X.9.3.386.
- 50
Thrun M.C., Ultsch A. (2020). Clustering benchmark datasets exploiting the fundamental clustering problems. Data in Brief, 30:105501. DOI: 10.1016/j.dib.2020.105501.
- 51
Ullmann T., Beer A., Hünemörder M., Seidl T., Boulesteix A.-L. (2022). Over-optimistic evaluation and reporting of novel cluster algorithms: An illustrative study. Advances in Data Analysis and Classification. DOI: 10.1007/s11634-022-00496-5.
- 52
Ullmann T., Hennig C., Boulesteix A.-L. (2021). Validation of cluster analysis results on validation data: A systematic framework. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3):e14444. DOI: 10.1002/widm.1444.
- 53
Ultsch A. (2005). Clustering with SOM: U*C. In: Workshop on Self-Organizing Maps, pp. 75–82.
- 54
Veenman C.J., Reinders M.J.T., Backer E. (2002). A maximum variance cluster algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1273–1280.
- 55
Vinh N.X., Epps J., Bailey J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(95):2837–2854. URL: http://jmlr.org/papers/v11/vinh10a.html.
- 56
von Luxburg U., Williamson R.C., Guyon I. (2012). Clustering: science or art? In: Guyon I., et al., editors, Proc. ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proc. Machine Learning Research, pp. 65–79.
- 57
Ward Jr. J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244. DOI: 10.1080/01621459.1963.10500845.
- 58
Weise T., et al. (2014). Benchmarking optimization algorithms: An open source framework for the traveling salesman problem. IEEE Computational Intelligence Magazine, 9(3):40–52. DOI: 10.1109/MCI.2014.2326101.
- 59
Xiao H., Rasul K., Vollgraf R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. URL: https://arxiv.org/pdf/1708.07747.pdf, DOI: 10.48550/arXiv.1708.07747.
- 60
Xu Q., Zhang Q., Liu J., Luo B. (2020). Efficient synthetical clustering validity indexes for hierarchical clustering. Expert Systems with Applications, 151:113367. DOI: 10.1016/j.eswa.2020.113367.
- 61
Zahn C.T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1):68–86.
- 62
Zhang T., Ramakrishnan R., Livny M. (1997). BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182. DOI: 10.1023/A:1009783824328.
- 63
Van Mechelen I., et al. (2018). Benchmarking in cluster analysis: A white paper. URL: https://arxiv.org/pdf/1809.10496.pdf.