References

How to Cite

When using this framework in research publications, please cite [21] as specified below. Thank you.

Additionally, please mention the benchmark suite project [27] as well as the literature references listed in the description files corresponding to each dataset studied.

Bibliography

1

Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243–256. DOI: https://doi.org/10.1016/j.patcog.2012.07.021.

2

Ball G.H., Hall D.J. (1965). ISODATA: A novel method of data analysis and pattern classification. Technical Report AD699616, Stanford Research Institute.

3

Bezdek J.C., Keller J.M., Krishnapuram R., Kuncheva L.I., Pal N.R. (1999). Will the real iris data please stand up? IEEE Transactions on Fuzzy Systems, 7(3):368–369. DOI: 10.1109/91.771092.

4

Bezdek J.C., Pal N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(3):301–315. DOI: 10.1109/3477.678624.

5

Blum A., Hopcroft J., Kannan R. (2020). Foundations of Data Science. Cambridge University Press. URL: https://www.cs.cornell.edu/jeh/book.pdf.

6

Buitinck L., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122.

7

Caliński T., Harabasz J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1):1–27. DOI: 10.1080/03610927408827101.

8

Chang H., Yeung D.Y. (2008). Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203.

9

Crouse D.F. (2016). On implementing 2D rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696. DOI: 10.1109/TAES.2016.140952.

10

Dasgupta S., Ng V. (2009). Single data, multiple clusterings. In: Proc. NIPS Workshop Clustering: Science or Art? Towards Principled Approaches. URL: https://clusteringtheory.org.

11

Davies D.L., Bouldin D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI–1(2):224–227. DOI: 10.1109/TPAMI.1979.4766909.

12

Dua D., Graff C. (2022). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

13

Dunn J.C. (1974). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3):32–57. DOI: 10.1080/01969727308546046.

14

Edwards A.W.F., Cavalli-Sforza L.L. (1965). A method for cluster analysis. Biometrics, 21(2):362–375. DOI: 10.2307/2528096.

15

Fisher R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188.

16

Fowlkes E.B., Mallows C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.

17

Fränti P., Sieranoja S. (2018). K-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12):4743–4759. DOI: 10.1007/s10489-018-1238-7.

18

Fränti P., Virmajoki O. (2006). Iterative shrinking method for clustering problems. Pattern Recognition, 39(5):761–765.

19

Fu L., Medico E. (2007). FLAME: A novel fuzzy clustering method for the analysis of DNA microarray data. BMC bioinformatics, 8:3.

20

Gagolewski M. (2021). genieclust: Fast and robust hierarchical clustering. SoftwareX, 15:100722. DOI: 10.1016/j.softx.2021.100722.

21

Gagolewski M. (2022). A framework for benchmarking clustering algorithms. SoftwareX, 20:101270. URL: https://clustering-benchmarks.gagolewski.com, DOI: 10.1016/j.softx.2022.101270.

22

Gagolewski M. (2022). Adjusted asymmetric accuracy: A well-behaving external cluster validity measure. under review (preprint). URL: https://arxiv.org/pdf/2209.02935.pdf, DOI: 10.48550/arXiv.2209.02935.

23

Gagolewski M. (2022). Minimalist Data Wrangling with Python. Zenodo, Melbourne. ISBN 978-0-6455719-1-2. URL: https://datawranglingpy.gagolewski.com/, DOI: 10.5281/zenodo.6451068.

24

Gagolewski M. (2023). Deep R Programming. Zenodo, Melbourne. ISBN 978-0-6455719-2-9 (reserved). early draft. URL: https://deepr.gagolewski.com/, DOI: 10.5281/zenodo.7490464.

25

Gagolewski M., Bartoszuk M., Cena A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences, 363:8–23. DOI: 10.1016/j.ins.2016.05.003.

26

Gagolewski M., Bartoszuk M., Cena A. (2021). Are cluster validity measures (in)valid? Information Sciences, 581:620–636. DOI: 10.1016/j.ins.2021.10.004.

27

Gagolewski M., et al. (2022). A benchmark suite for clustering algorithms: version 1.1.0. URL: https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0, DOI: 10.5281/zenodo.7088171.

28

Gionis A., Mannila H., Tsaparas P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1):4.

29

Graves D., Pedrycz W. (2010). Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study. Fuzzy Sets and Systems, 161:522–543. DOI: 10.1016/j.fss.2009.10.021.

30

Halkidi M., Batistakis Y., Vazirgiannis M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, pages 107–145. DOI: 10.1023/A:1012801612483.

31

Hubert L., Arabie P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218. DOI: 10.1007/BF01908075.

32

Jain A., Law M. (2005). Data clustering: A user's dilemma. Lecture Notes in Computer Science, 3776:1–10.

33

Jamil M., Yang X.-S., Zepernick H.-J. (2013). 8-test functions for global optimization: A comprehensive survey. In: Swarm Intelligence and Bio-Inspired Computation, pp. 193–222. DOI: 10.1016/B978-0-12-405163-8.00008-9.

34

Karypis G., Han E.H., Kumar V. (1999). CHAMELEON: Hierarchical clustering using dynamic modeling. Computer, 32(8):68–75. DOI: 10.1109/2.781637.

35

Kvalseth T.O. (1987). Entropy and correlation: Some comments. IEEE Trans. Systems, Man and Cybernetics, 17(3):517–519. DOI: 10.1109/TSMC.1987.4309069.

36

Kärkkäinen I., Fränti P. (2002). Dynamic local search algorithm for the clustering problem. In: Proc. 16th Intl. Conf. Pattern Recognition'02, volume 2, pp. 240–243. IEEE.

37

Lloyd S.P. (1957). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:128–137. Originally a 1957 Bell Telephone Laboratories Research Report; republished in 1982. DOI: 10.1109/TIT.1982.1056489.

38

Maulik U., Bandyopadhyay S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1650–1654. DOI: 10.1109/TPAMI.2002.1114856.

39

McInnes L., Healy J., Astels S. (2017). hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11):205. DOI: 10.21105/joss.00205.

40

Meilă M., Heckerman D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42:9–29. DOI: 10.1023/A:1007648401407.

41

Milligan G.W., Cooper M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.

42

Müller A.C., Nowozin S., Lampert C.H. (2012). Information theoretic clustering using minimum spanning trees. In: Proc. German Conference on Pattern Recognition. URL: https://github.com/amueller/information-theoretic-mst.

43

Müllner D. (2013). fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software, 53(9):1–18. DOI: 10.18637/jss.v053.i09.

44

Pedregosa F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(85):2825–2830. URL: http://jmlr.org/papers/v12/pedregosa11a.html.

45

Rand W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850. DOI: 10.2307/2284239.

46

Rezaei M., Fränti P. (2016). Set matching measures for external cluster validity. IEEE Transactions on Knowledge and Data Engineering, 28(8):2173–2186. DOI: 10.1109/TKDE.2016.2551240.

47

Rousseeuw P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. DOI: 10.1016/0377-0427(87)90125-7.

48

Sieranoja S., Fränti P. (2019). Fast and general density peaks clustering. Pattern Recognition Letters, 128:551–558. DOI: 10.1016/j.patrec.2019.10.019.

49

Steinley D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. DOI: 10.1037/1082-989X.9.3.386.

50

Thrun M.C., Ultsch A. (2020). Clustering benchmark datasets exploiting the fundamental clustering problems. Data in Brief, 30:105501. DOI: 10.1016/j.dib.2020.105501.

51

Ullmann T., Beer A., Hünemörder M., Seidl T., Boulesteix A.-L. (2022). Over-optimistic evaluation and reporting of novel cluster algorithms: An illustrative study. Advances in Data Analysis and Classification. DOI: 10.1007/s11634-022-00496-5.

52

Ullmann T., Hennig C., Boulesteix A.-L. (2021). Validation of cluster analysis results on validation data: A systematic framework. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3):e14444. DOI: 10.1002/widm.1444.

53

Ultsch A. (2005). Clustering with SOM: U*C. In: Workshop on Self-Organizing Maps, pp. 75–82.

54

Veenman C.J., Reinders M.J.T., Backer E. (2002). A maximum variance cluster algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1273–1280.

55

Vinh N.X., Epps J., Bailey J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(95):2837–2854. URL: http://jmlr.org/papers/v11/vinh10a.html.

56

von Luxburg U., Williamson R.C., Guyon I. (2012). Clustering: science or art? In: Guyon I., et al., editors, Proc. ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proc. Machine Learning Research, pp. 65–79.

57

Ward Jr. J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244. DOI: 10.1080/01621459.1963.10500845.

58

Weise T., et al. (2014). Benchmarking optimization algorithms: An open source framework for the traveling salesman problem. IEEE Computational Intelligence Magazine, 9(3):40–52. DOI: 10.1109/MCI.2014.2326101.

59

Xiao H., Rasul K., Vollgraf R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. URL: https://arxiv.org/pdf/1708.07747.pdf, DOI: 10.48550/arXiv.1708.07747.

60

Xu Q., Zhang Q., Liu J., Luo B. (2020). Efficient synthetical clustering validity indexes for hierarchical clustering. Expert Systems with Applications, 151:113367. DOI: 10.1016/j.eswa.2020.113367.

61

Zahn C.T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1):68–86.

62

Zhang T., Ramakrishnan R., Livny M. (1997). BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182. DOI: 10.1023/A:1009783824328.

63

Van Mechelen I., et al. (2018). Benchmarking in cluster analysis: A white paper. URL: https://arxiv.org/pdf/1809.10496.pdf.