No Access

Databionics Research Group, Philipps-University of Marburg, D-35032 Marburg, Germany

Department of Hematology, Oncology and Immunology, Philipps-University Marburg, Germany

E-mail Address: mthrun@mathematik.uni-marburg.de

https://doi.org/10.1142/S1469026821500164Cited by:12 (Source: Crossref)

Abstract

Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.

Keywords:

Remember to check out the Most Cited Articles!
Check out these titles in artificial intelligence!

References

1. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining (American Association for Artificial Intelligence Press, Menlo Park, CA, 1996). Google Scholar
2. P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining (Pearson, London, UK, 2006). Google Scholar
3. R. Yang, Y. Jiang, S. Mathews, E. A. Housworth, M. W. Hahn and P. Radivojac, A new class of metrics for learning on real-valued and structured data, Data Min. Knowl. Discov. 33 (2019) 995–1016. Crossref, Google Scholar
4. A. Brazma and J. Vilo, Gene expression data analysis, FEBS Lett. 480 (2000) 17–24. Crossref, Google Scholar
5. A. Hampapur and R. M. Bolle, Comparison of distance measures for video copy detection, in 2001 IEEE Int. Conf. Multimedia and Expo (IEEE, New York, NY, 2001), pp. 1–188. Crossref, Google Scholar
6. M. Li, X. Chen, X. Li, B. Ma and P. Vitányi, The similarity metric, Trans. Inf. Theory 50 (2004) 3250–3264. Crossref, Google Scholar
7. S. H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, Int. J. Math. Models Methods Appl. Sci. 1 (2007) 300–307. Google Scholar
8. L. Yujian and L. Bo, A normalized levenshtein distance metric, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 1091–1095. Crossref, Google Scholar
9. R. Kumar and S. Vassilvitskii, Generalized distances between rankings, in Proc. 19th Int. Conf. World Wide Web (ACM, New York, NY, 2010), pp. 571–580. Crossref, Google Scholar
10. M. Cao, H. Zhang, J. Park, N. M. Daniels, M. E. Crovella, L. J. Cowen and B. Hescott, Going the distance for protein function prediction: A new distance metric for protein interaction networks, PLOS ONE 8 (2013) e76339. Crossref, Google Scholar
11. R. Gentleman, B. Ding, S. Dudoit and J. Ibrahim, Distance measures in DNA microarray data analysis, in Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit (Eds.), (Springer, New York, 2005), pp. 189–208, https://doi.org/10.1007/0-387-29362-0_12. Crossref, Google Scholar
12. I. Priness, O. Maimon and I. Ben-Gal, Evaluation of gene-expression clustering via mutual information distance measure, BMC Bioinform. 8 (2007) 111. Crossref, Google Scholar
13. T. N. Phyu, Survey of classification techniques in data mining, in Proc. Int. MultiConference Engineers and Computer Scientists (IMECS, Hong Kong, 2009), pp. 18–20. Google Scholar
14. R. M. Cormack, A review of classification, J. R. Stat. Soc. (Gen.) 134 (1971) 321–367. Crossref, Google Scholar
15. G. M. Mimmack, S. J. Mason and J. S. Galpin, Choice of distance matrices in cluster analysis: Defining regions, J. Clim. 14 (2001) 2790–2797. Crossref, Google Scholar
16. F. Mörchen, Time Series Knowledge Mining. Dissertation (Citeseer/Görich & Weiershäuser, Marburg, Germany, 2006). Google Scholar
17. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data (Prentice Hall College Div, Englewood Cliffs, NJ, 1988). Google Scholar
18. T. M. Mitchell, Machine Learning (McGraw-Hill Education, India, 1997). Google Scholar
19. C. C. Aggarwal, A. Hinneburg and D. A. Keim, in Int. Conf. Database Theory, eds. J. Van den Bussche and V. Vianu (Springer, Berlin, Heidelberg, 2001), pp. 420–434. Google Scholar
20. E. P. Xing, M. I. Jordan, S. J. Russell and A. Y. Ng, Distance metric learning with application to clustering with side-information, in Proc. 15th Int. Conf. Neural Neural Information Processing Systems (NIPS’02) (MIT Press, Cambridge, MA, 2003), pp. 521–528. Google Scholar
21. G. Glazko, A. Gordon and A. Mushegian, The choice of optimal distance measure in genome-wide datasets, Bioinformatics 21 (2005) iii3–iii11. Crossref, Google Scholar
22. R. Shahid, S. Bertazzon, M. L. Knudtson and W. A. Ghali, Comparison of distance measures in spatial analytical modeling for health service planning, BMC Health Serv. Res. 9 (2009) 200. Crossref, Google Scholar
23. T. Bozkaya and M. Ozsoyoglu, Distance-based indexing for high-dimensional metric spaces, in ACM SIGMOD Rec. (ACM, New York, NY, 1997), pp. 357–368. Crossref, Google Scholar
24. K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, When is “nearest neighbor” meaningful? in Int. Conf. Database Theory (Springer, Berlin, 1999), pp. 217–235. Crossref, Google Scholar
25. F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor and Y. Moreau, Adaptive quality-based clustering of gene expression profiles, Bioinformatics 18 (2002) 735–746. Crossref, Google Scholar
26. M. C. Thrun, Cluster analysis of per capita gross domestic products, Entrep. Bus. Econ. Rev. 7 (2019) 217–231. Google Scholar
27. M. Basseville, Distance measures for signal processing and pattern recognition, Sig. Process. 18 (1989) 349–369. Crossref, Google Scholar
28. J. J. Sooful and E. C. Botha, Comparison of acoustic distance measures for automatic cross-language phoneme mapping, in Proc. Seventh International Conference on Spoken Language Processing (ICSLP) (Denver, Colorado, USA, 16-20th September, 2002), pp. 521–524. Google Scholar
29. D. G. Gavin, W. W. Oswald, E. R. Wahl and J. W. Williams, A statistical approach to evaluating distance metrics and analog assignments for pollen records, Quat. Res. 60 (2003) 356–367. Crossref, Google Scholar
30. A. Schenker, M. Last, H. Bunke and A. Kandel, in Int. Workshop on Graph-Based Representations in Pattern Recognition, eds. E. Hancock and M. Vento (Springer, Berlin, Heidelberg, 2003), pp. 202–213. Crossref, Google Scholar
31. H. Finch, Comparison of distance measures in cluster analysis with dichotomous data, J. Data Sci. 3 (2005) 85–100. Crossref, Google Scholar
32. S. D. Bharkad and M. Kokare, Performance evaluation of distance metrics: application to fingerprint recognition, Int. J. Pattern Recognit. Artif. Intell. 25 (2011) 777–806. Link, Google Scholar
33. X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann and E. Keogh, Experimental comparison of representation methods and distance measures for time series data, Data Min. Knowl. Discov. 26 (2013) 275–309. Crossref, Google Scholar
34. R. E. Bonner, On some clustering technique, IBM J. Res. Dev. 8 (1964) 22–32. Crossref, Google Scholar
35. C. Hennig, M. Meila, F. Murtagh and R. Rocci, Handbook of Cluster Analysis (Chapman & Hall/CRC Press, New York, NY, 2015). Crossref, Google Scholar
36. M. Verleysen, D. Francois, G. Simon and V. Wertz, in Artificial Neural Nets Problem Solving Methods (Springer, 2003), pp. 105–112. Crossref, Google Scholar
37. M. C. Thrun and A. Ultsch, Clustering benchmark datasets exploiting the fundamental clustering problems, Data Br. 30 (2020) 105501. Crossref, Google Scholar
38. M. C. Thrun, T. Gehlert and A. Ultsch, Analyzing the fine structure of distributions, PLOS ONE 15 (2020) e0238835. Crossref, Google Scholar
39. C. Bouveyron, B. Hammer and T. Villmann, Recent developments in clustering algorithms, in ESANN 2012 Proc. EuropeanSymp. Artificial Neural Networks, Computational Intelligence and Machine Learning (Citeseer, Bruges, Belgium, 2012), pp. 447–458. Google Scholar
40. R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (John Wiley & Sons, Ney York, NY, 2001). Google Scholar
41. J. Handl, J. Knowles and D. B. Kell, Computational cluster validation in post-genomic data analysis, Bioinformatics 21 (2005) 3201–3212. Crossref, Google Scholar
42. O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. PéRez and I. Perona, An extensive comparative study of cluster validity indices, Pattern Recognit. 46 (2013) 243–256. Crossref, Google Scholar
43. J. Venna, J. Peltonen, K. Nybo, H. Aidos and S. Kaski, Information retrieval perspective to nonlinear dimensionality reduction for data visualization, J. Mach. Learn. Res. 11 (2010) 451–490. Google Scholar
44. A. Ultsch and M. C. Thrun. Credible visualizations for planar projections, in 12th Int. Workshop on Self-Organizing Maps and Learning Vector Quantization, Clustering and Data Visualization (WSOM), M. Cottrell (Ed.) (IEEE, Nany, France, 2017), pp. 1–5. Crossref, Google Scholar
45. M. C. Thrun, Projection Based Clustering through Self-Organization and Swarm Intelligence (Springer, Heidelberg, 2018). Crossref, Google Scholar
46. A. Neumaier, Combinatorial Configurations in Terms of Distances, Department of Mathematics Memorandum 81-09 (Eindhoven University, Eindhoven, 1981). Google Scholar
47. J. C. Gower and P. Legendre, Metric and Euclidean properties of dissimilarity coefficients, J. Classif. 3 (1986) 5–48. Crossref, Google Scholar
48. M. M. Deza and E. Deza, in Encyclopedia of Distances (Springer, 2009), pp. 1–583. Crossref, Google Scholar
49. H. H. Bock, in Studia Mathematica, eds. K. P. Grotemeyer, D. Morgenstern and H. Tietz (Vandenhoeck & Ruprecht, Göttingen, Germany, 1974), pp. 1–480. Google Scholar
50. A. Ultsch, in Innovations in Classification, Data Science, and Information Systems, eds. D. Baier and K. D. Werrnecke (Springer, Berlin, Germany, 2005), pp. 91–100. Crossref, Google Scholar
51. A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B (Methodol.) 39 (1977) 1–22. Google Scholar
52. J. A. Hartigan and P. M. Hartigan, The dip test of unimodality, Ann. Stat. 13 (1985) 70–84. Crossref, Google Scholar
53. A. Ultsch, M. C. Thrun, O. Hansen-Goos and J. Lötsch, Identification of molecular fingerprints in human heat pain thresholds by use of an interactive mixture model R toolbox (AdaptGauss), Int. J. Mol. Sci. 16 (2015) 25897–25911. Crossref, Google Scholar
54. R. E. Bellman, Adaptive Control Processes: A Guided Tour (Princeton University Press, 1961). Crossref, Google Scholar
55. A. Eckert, ParallelDist: Parallel Distance Matrix Computation Using Multiple Threads (Version 0.2.4), CRAN (2018) https://CRAN.R-project.org/package=parallelDist. Google Scholar
56. J. H. Ward, Jr., Hierarchical grouping to optimize an objective function, J. Am. Stat. Assoc. 58 (1963) 236–244. Crossref, Google Scholar
57. J. R. Michael, The stabilized probability plot, Biometrika 70 (1983) 11–17. Crossref, Google Scholar
58. M. C. Thrun, Knowledge discovery in quarterly financial data of stocks based on the prime standard using a hybrid of a swarm with SOM, in Eur. Symp. Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Vol. 27, pp. 397–402, Ciaco, 978-287-587-065-0, Bruges, Belgium, 2019. Google Scholar
59. M. C. Thrun and A. Ultsch, Swarm intelligence for self-organized clustering, Artif. Intell. 290 (2021) 103237. Crossref, Google Scholar
60. M. C. Thrun and A. Ultsch, Uncovering high-dimensional structures of projections from dimensionality reduction methods, MethodsX 7 (2020) 101093. Crossref, Google Scholar
61. M. C. Thrun and Q. Stier, Fundamental clustering algorithms suite, SoftwareX 13 (2021) 100642. Crossref, Google Scholar
62. P. Franck, E. Cameron, G. Good, J. Y. Rasplus and B. Oldroyd, Nest architecture and genetic differentiation in a species complex of Australian stingless bees, Mol. Ecol. 13 (2004) 2317–2331. Crossref, Google Scholar
63. C. Hennig, in Data Analysis, Machine Learning and Knowledge Discovery (Springer, 2014), pp. 41–49. Crossref, Google Scholar
64. B. Hausdorf and C. Hennig, Species delimitation using dominant and codominant multilocus markers, Syst. Biol. 59 (2010) 491–503. Crossref, Google Scholar
65. F. Ball and A. Geyer-Schulz, Invariant graph partition comparison measures, Symmetry 10 (2018) 1–27. Crossref, Google Scholar
66. C. Weihs and G. Szepannek, Distances in classification, in P., P. (Ed.), ed., Advances in data mining applications and theoretical aspects, Proc. IEEE International Conference on Data Mining (ICDM), Vol. 5633, Springer Berlin, Heidelberg, pp. 1–12, Miami, Florida, USA, 6–9th December, 2009. Google Scholar
67. I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, … A. Zimek, On using class-labels in evaluation of clusterings, Proc. MultiClust: 1st International Workshop on Discovering, Summarizing and using Multiple Clusterings held in Conjunction with 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), pp. 1–9, Washington, DC, 25-28th July, 2010. Google Scholar