No Access

Variational Autoencoder-Based Dimensionality Reduction for High-Dimensional Small-Sample Data Classification

National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, P. R. China

Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, P. R. China

E-mail Address: sultan@szu.edu.cn

Corresponding author.

Search for more papers by this author

Joshua Zhexue Huang

National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, P. R. China

Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, P. R. China

E-mail Address: zx.huang@szu.edu.cn

Search for more papers by this author

, and

Xianghua Fu

National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, P. R. China

Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, P. R. China

E-mail Address: fuxh@szu.edu.cn

Search for more papers by this author

https://doi.org/10.1142/S1469026820500029Cited by:33 (Source: Crossref)

Abstract

Classification problems in which the number of features (dimensions) is unduly higher than the number of samples (observations) is an essential research and application area in a variety of domains, especially in computational biology. It is also known as a high-dimensional small-sample-size (HDSSS) problem. Various dimensionality reduction methods have been developed, but they are not potent with the small-sample-sized high-dimensional datasets and suffer from overfitting and high-variance gradients. To overcome the pitfalls of sample size and dimensionality, this study employed variational autoencoder (VAE), which is a dynamic framework for unsupervised learning in recent years. The objective of this study is to investigate a reliable classification model for high-dimensional and small-sample-sized datasets with minimal error. Moreover, it evaluated the strength of different architectures of VAE on the HDSSS datasets. In the experiment, six genomic microarray datasets from Kent Ridge Biomedical Dataset Repository were selected, and several choices of dimensions (features) were applied for data preprocessing. Also, to evaluate the classification accuracy and to find a stable and suitable classifier, nine state-of-the-art classifiers that have been successful for classification tasks in high-dimensional data settings were selected. The experimental results demonstrate that the VAE can provide superior performance compared to traditional methods such as PCA, fastICA, FA, NMF, and LDA in terms of accuracy and AUROC.

Keywords:

Remember to check out the Most Cited Articles!
Check out these titles in artificial intelligence!

References

1. L. Ein-Dor, O. Zuk and E. Domany, Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer, Proc. Natl. Acad. Sci. USA, 103(15) (2006) 5923–5928. Crossref, Google Scholar
2. R. Clarke, H. W. Ressom, A. Wang, J. Xuan, M. C. Liu, E. A. Gehan and Y. Wang, The properties of high-dimensional data spaces: Implications for exploring gene and protein expression data, Nat. Rev. Cancer 8(1) (2008) 37–49. Crossref, Google Scholar
3. J. Friedman, T. Hastie and R. Tibshirani, The Elements of Statistical Learning, 2nd edn. (Springer series in statistics, 2008). Google Scholar
4. M. Köppen, The curse of dimensionality, in 5th Online World Conf. Soft Computing in Industrial Applications (WSC5) (Helsinki, Finland, 2000), pp. 1–22. Google Scholar
5. K. Y. Yeung and W. L. Ruzzo, Principal component analysis for clustering gene expression data, Bioinformatics 17(9) (2001) 763–774. Crossref, Google Scholar
6. I. Jolliffe, Principal Component Analysis, 2nd edn. (Springer, 2002). Google Scholar
7. A. Gupta, H. Wang and M. Ganapathiraju, Learning structure in gene expression data using deep architectures with an application to gene clustering, in IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM) ( Washington, DC, USA, 2015), pp. 1328–1335. Crossref, Google Scholar
8. J. Tan, M. Ung, C. Cheng and C. S. Greene, Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders, Pac. Symp. Biocomput. 20 (2015) 132–143. Google Scholar
9. P. Danaee, R. Ghaeini and D. A. Hendrix, A deep learning approach for cancer detection and relevant gene identification, Pac. Symp. Biocomput. 22 (2017) 219–229. Google Scholar
10. B. Liu, Y. Wei, Y. Zhang and Q. Yang, Deep neural networks for high dimension low sample size data, in Proc. 26th Int. Joint Conf. Artificial Intelligence (IJCAI’17) ( Melbourne, Australia, 2017), pp. 2287–2293. Crossref, Google Scholar
11. W. Zhao, Research on the deep learning of the small sample data based on transfer learning, AIP Conf. Proc. 1864(1) (2017) 1–8. Google Scholar
12. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958. Google Scholar
13. M. S. Mahmud, X. Fu, J. Z. Huang and M. A. Masud, High-dimensional limited-sample biomedical data classification using variational autoencoder, in The 16th Australasian Data Mining Conf. (AusDM2018) ( NSW, Australia, 2018), pp. 30–42. Google Scholar
14. J. J. Dai, L. Lieu and D. Rocke, Dimension reduction for classification with gene expression microarray data, Stat. Appl. in Genet. Mol. Biol. 5(1) (2006) 1–20. Crossref, Google Scholar
15. D. Mishra, R. Dash, A. K. Rath and M. Acharya, Feature selection in gene expression data using principal component analysis and rough set theory, Adv. Exp. Med. Biol. 696 (2011) 91–100. Crossref, Google Scholar
16. Y. Lu and J. Han, Cancer classification using gene expression data, Inf. Syst. 28(4) (2003) 243–268. Crossref, Google Scholar
17. E. Formisano, F. D. Martino and G. Valente, Multivariate analysis of fMRI time series: Classification and regression of brain responses using machine learning, Magn. Reson. Imag. 26(7) (2008) 921–934. Crossref, Google Scholar
18. Y. Fan, D. Shen, R. C. Gur, R. E. Gur and C. Davatzikos, COMPARE: Classification of morphological patterns using adaptive regional elements, IEEE Trans. Med. Imag. 26(1) (2007) 93–105. Crossref, Google Scholar
19. G. Wang, A. V. Kossenkov and M. F. Ochs, LS-NMF: A modified nonnegative matrix factorization algorithm utilizing uncertainty estimates, BMC Bioinf. 7(175) (2006) 1–10. Google Scholar
20. L. Weixiang, Z. Nanning and Y. Qubo, Nonnegative matrix factorization and its applications in pattern recognition, Chin. Sci. Bull. 51(1) (2006) 7–18. Crossref, Google Scholar
21. W. M. Berry, M. Brown, A. N. Langville, P. Paucac and R. J. Plemmons, Algorithms and applications for the nonnegative matrix factorization, Comput. Stat. Data Anal. 52(1) (2007) 155–173. Crossref, Google Scholar
22. A. Pascual-Montano, P. Carmona-Saez, M. Chagoyen, F. Tirado, J. M. Carazo and R. D. Pascual-Marqui, bioNMF: A versatile tool for nonnegative matrix factorization in biology, BMC Bioinf. 7(366) (2006) 1–9. Google Scholar
23. P. M. Kim and B. Tidor, Subsystem identification through dimensionality reduction of large-scale gene expression data, Genome Res. 13(7) (2003) 1706–1718. Crossref, Google Scholar
24. Y. Gao and G. Church, Improving molecular cancer class discovery through sparse non-negative matrix factorization, Bioinformatics 21(21) (2005) 3970–3975. Crossref, Google Scholar
25. W. Liu, Y. Kehong and Y. Datian, Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis, J. Biomedical Informatics 41 (2008) 602–606. Crossref, Google Scholar
26. Z. Jiang, X. Zhou, X. Zhang and S. Chen, Using link topic model to analyze traditional Chinese medicine clinical symptom-herb regularities, in IEEE 14th Int. Conf. e-Health Networking, Applications and Services (Healthcom) ( Beijing, China, 2012), pp. 15–18. Crossref, Google Scholar
27. J. A. Dawsony and C. Kendziorski, Survival-supervised latent Dirichlet allocation models for genomic analysis of time-to-event outcomes, arXiv:1202.5999. Google Scholar
28. Y. Wu, M. Liu, W. J. Zheng, Z. Zhao and H. Xu, Ranking gene–drug relationships in biomedical literature using latent Dirichlet allocation, Biocomputing 2012: Proceedings of the Pacific Symposium (Kohala Coast, Hawaii, USA, 2012), pp. 422–433. Google Scholar
29. W. Zhao, W. Zou and J. J. Chen, Topic modeling for cluster analysis of large biological and medical datasets, BMC Bioinf. 15(11) (2014) 1–11. Google Scholar
30. H. M. Lu, C. P. Wei and F. Y. Hsiao, Modeling healthcare data using multiple-channel latent Dirichlet allocation, J. Biomed. Inform. 60 (2016) 210–223. Crossref, Google Scholar
31. S. J. Kho, H. B. Yalamanchili, M. L. Raymer and A. P. Sheth, A novel approach for classifying gene expression data using topic modelling, in Proc. 8th ACM Int. Conf. Bioinformatics, Computational Biology, and Health Informatics (ACM–BCB’17) ( Massachusetts, USA, 2017), pp. 388–393. Google Scholar
32. A. K. Jain and B. Chandrasekaran, 39 dimensionality and sample size considerations in pattern recognition practice, Handbook Statist. 2 (1982) 835–855. Crossref, Google Scholar
33. P. Smialowski, D. Frishman and S. Kramer, Pitfalls of supervised feature selection, Bioinformatics 26(3) (2010) 440–443. Crossref, Google Scholar
34. S. Diciotti, S. Ciulli, M. Mascalchi, M. Giannelli and N. Toschi, The “peeking” effect in supervised feature selection on diffusion tensor imaging data, Am. J. Neuroradiol. 34(9) (2013) 1. Crossref, Google Scholar
35. D. P. Kingma and M. Welling, Auto-encoding variational Bayes, in Proc. 2nd Int. Conf. Learning Representations (Banff, Canada, 2014). arXiv:1312.6114 Google Scholar
36. D. J. Rezende, S. Mohamed and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, in Proc. 31st Int. Conf. Machine Learning (ICML’14) (Beijing, China, 2014), pp. 1278–1286. Google Scholar
37. D. P. Kingma, T. Salimans and M. Welling, Improving variational inference with inverse autoregressive flow, in Proc. 30th Int. Conf. Neural Information Processing Systems (NIPS’16) ( Barcelona, Spain, 2016), pp. 4743–4751. Google Scholar
38. D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor and D. Wierstra, One-shot generalization in deep generative models, in Proc. 33rd Int. Conf. Machine Learning (ICML’16) ( NY, USA, 2016), pp. 1521–1529. Google Scholar
39. D. J. Rezende and S. Mohamed, Variational inference with normalizing flows, in Proc. 32nd Int. Conf. Machine Learning (ICML’15) ( Lille, France, 2015), pp. 1530–1538. Google Scholar
40. D. Tran, R. Ranganath and D. M. Blei, The variational Gaussian process, in 4th Int. Conf. Learning Representations (ICLR’16) ( San Juan, Puerto Rico, 2016), pp. 1–14. Google Scholar
41. L. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools And Techniques, 2nd edn. (Morgan Kaufrnann, 2005). Google Scholar