UNSUPERVISED CLASSIFICATION OF TREE STRUCTURED OBJECTS
Recent developments in medical image analysis, phylogenetics and proteomics motivate the statistical analysis of populations of tree-structured data objects. In this context, unsupervised classification of trees arises as a challenging new area that depends on the careful development of novel mathematical framework. The discussion will center on statistical aspects of clustering in a framework where the tree data to be clustered has been sampled from some unknown probability distribution. Following Ref. 12, we will try to verify two conditions: appropriateness, the clustering of the data set should reveal some structure of the underlying data rather than model artifacts due to the random sampling process; and steadiness, the more sample points we have, the more reliable the clustering should be. We will argue about steadiness and reliability by showing an extension of the convergence properties for a class of non-parametric clustering algorithm: k-means, defined on different metric spaces of trees. We will explore the appropriateness of the clustering outputs of k-means on a real data set from proteomics, and we will comment the results from Ref. 1 on three real data sets of phylogenetic trees.