Research ArticleNo Access

A model-based clustering algorithm with covariates adjustment and its application to lung cancer stratification

Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão 1010 São Paulo, São Paulo 05508-090, Brazil

E-mail Address: carlos.edu.relvas@gmail.com

Search for more papers by this author

Asuka Nakata

Cancer Research Institute, Kanazawa University, Kanazawa, Ishikawa 920-1164, Japan

E-mail Address: a.nakata.bs@gmail.com

Search for more papers by this author

Guoan Chen

School of Medicine, Southern University of Science and Technology, 1088 Xueyuan Blvd. Shenzhen, Guangdong 518055, P. R. China

E-mail Address: cheng@sustech.edu.cn

Search for more papers by this author

David G. Beer

Rogel Cancer Center, University of Michigan, 1500 E Medical Center Dr Ann Arbor, Michigan 48109, USA

E-mail Address: dgbeer@med.umich.edu

Search for more papers by this author

Noriko Gotoh

Cancer Research Institute, Kanazawa University, Kanazawa, Ishikawa 920-1164, Japan

E-mail Address: ngotoh@staff.kanazawa-u.ac.jp

Search for more papers by this author

, and

Andre Fujita

https://orcid.org/0000-0002-7756-7051

Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão 1010 São Paulo, São Paulo 05508-090, Brazil

E-mail Address: andrefujita@usp.br

Corresponding author.

Search for more papers by this author

https://doi.org/10.1142/S0219720023500191Cited by:1 (Source: Crossref)

Abstract

Usually, the clustering process is the first step in several data analyses. Clustering allows identify patterns we did not note before and helps raise new hypotheses. However, one challenge when analyzing empirical data is the presence of covariates, which may mask the obtained clustering structure. For example, suppose we are interested in clustering a set of individuals into controls and cancer patients. A clustering algorithm could group subjects into young and elderly in this case. It may happen because the age at diagnosis is associated with cancer. Thus, we developed CEM-Co, a model-based clustering algorithm that removes/minimizes undesirable covariates’ effects during the clustering process. We applied CEM-Co on a gene expression dataset composed of 129 stage I non-small cell lung cancer patients. As a result, we identified a subgroup with a poorer prognosis, while standard clustering algorithms failed.

Keywords:

References

1. MacQueen J, Some methods for classification and analysis of multivariate observations, Proc Fifth Berkeley Symp Mathematical Statistics and Probability, University of California Press, pp. 281–297, 1967. Google Scholar
2. Kaufman L, Rousseeuw P, Clustering by Means of Medoids, North-Holland, 1987. Google Scholar
3. Ward Jr, JH, Hierarchical grouping to optimize an objective function, J Am Stat Assoc 58(301) :236–244, 1963. Crossref, Google Scholar
4. McLachlan GJ, Peel D, Finite Mixture Models, John Wiley & Sons, 2004. Crossref, Google Scholar
5. Ester M, Kriegel HP, Sander J, Xu X, A density-based algorithm for discovering clusters in large spatial databases with noise, Proc Second Int Conf Knowledge Discovery and Data Mining, AAAI Press, pp. 226–231, 1996. Google Scholar
6. Cheng Y, Mean shift, mode seeking, and clustering, IEEE Trans Pattern Anal Mach Intell 17(8) :790–799, 1995. Crossref, Google Scholar
7. Bezdek JC, Ehrlich R, Full W, FCM: The fuzzy $c$ $c$ -means clustering algorithm, Comput Geosci 10(2–3) :191–203, 1984. Crossref, Google Scholar
8. Von Luxburg U, A tutorial on spectral clustering, Stat Comput 17(4) :395–416, 2007. Crossref, Google Scholar
9. Bandeen-Roche K, Miglioretti DL, Zeger SL, Rathouz PJ, Latent variable regression for multiple discrete outcomes, J Am Stat Assoc 92(440) :1375–1386, 1997. Crossref, Google Scholar
10. Dayton CM, Macready GB, Concomitant-variable latent-class models, J Am Stat Assoc 83(401) :173–178, 1988. Crossref, Google Scholar
11. Asparouhov T, Muth́en B, Auxiliary variables in mixture modeling: Three-step approaches using Mplus, Struct Equ Model Multidiscip J 21(3) :329–341, 2014. Crossref, Google Scholar
12. Kamata A, Kara Y, Patarapichayatham C, Lan P, Evaluation of analysis approaches for latent class analysis with auxiliary linear growth model, Front Psychol 9 :130, 2018. Crossref, Medline, Google Scholar
13. Nylund-Gibson K, Grimm R, Quirk M, Furlong M, A latent transition mixture model using the three-step specification, Struct Equ Model Multidiscip J 21(3) :439–454, 2014. Crossref, Google Scholar
14. Vermunt JK, Latent class modeling with covariates: Two improved three-step approaches, Polit Anal 18(4) :450–469, 2010. Crossref, Google Scholar
15. Gudicha DW, Vermunt JK, Mixture model clustering with covariates using adjusted three-step approaches, in Algorithms from and for Nature and Life, Springer, pp. 87–94, 2013. Crossref, Google Scholar
16. Bolck A, Croon M, Hagenaars J, Estimating latent structure models with categorical variables: One-step versus three-step estimators, Polit Anal 12(1) :3–27, 2004. Crossref, Google Scholar
17. Vermunt JK, Magidson J, Latent Gold 4.0 Userś Guide, 2005. Google Scholar
18. Celeux G, Govaert G, A classification EM algorithm for clustering and two stochastic versions, Comput Stat Data Anal 14(3) :315–332, 1992. Crossref, Google Scholar
19. Celeux G, Govaert G, Gaussian parsimonious clustering models, Pattern Recognit 28(5) :781–793, 1995. Crossref, Google Scholar
20. Dereniowski D, Kubale M, Cholesky factorization of matrices in parallel and ranking of graphs, Int Conf Parallel Processing and Applied Mathematics, Springer, pp. 985–992, 2003. Google Scholar
21. Wilks SS, The large-sample distribution of the likelihood ratio for testing composite hypotheses, Ann Math Stat 9(1) :60–62, 1938. Crossref, Google Scholar
22. Hogg RV, McKean J, Craig AT, Introduction to Mathematical Statistics, Pearson Education, 2005. Google Scholar
23. Drton M, Plummer M, A Bayesian information criterion for singular models, J R Stat Soc 79(2) :323–380, 2017. Crossref, Google Scholar
24. De Boor C, On calculating with B-splines, J Approx Theory 6(1) :50–62, 1972. Crossref, Google Scholar
25. Schmittgen TD, Livak KJ, Analyzing real-time PCR data by the comparative CTmethod, Nat Protoc 3(6) :1101–1108, 2008. Crossref, Medline, Google Scholar
26. Ding C, He X, $K$ $K$ -means clustering via principal component analysis, Proc Twenty-First Int Conf Machine Learning, Association for Computing Machinery, pp. 29, 2004. Crossref, Google Scholar
27. Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M, Fuzzy $c$ $c$ -means algorithms for very large data, IEEE Trans Fuzzy Syst 20(6) :1130–1146, 2012. Crossref, Google Scholar
28. Zhu C-Q et al., Prognostic and predictive gene signature for adjuvant chemotherapy in resected non-small-cell lung cancer, J Clin Oncol 28(29) :4417–4424, 2010. Crossref, Medline, Google Scholar
29. McInnes L, Healy J, Saul N, Großberger L, UMAP: Uniform manifold approximation and projection, J Open Source Softw 3(29) :861, 2018, https://doi.org/10.21105/joss.00861. Crossref, Google Scholar
30. Campello RJGB, Moulavi D, Sander J, Density-based clustering based on hierarchical density estimates, in Pei J, Tseng VS, Cao L, Motoda H, Xu G (eds.), Advances in Knowledge Discovery and Data Mining, Springer, Berlin, pp. 160–172, 2013. Crossref, Google Scholar
31. McLachlan G, Krishnan T, The EM Algorithm and Extensions, John Wiley & Sons, 2007. Google Scholar
32. Wu CFJ, On the convergence properties of the EM algorithm, Ann Stat 11(1) :95–103, 1983. Crossref, Google Scholar