Special Issue: Selected Papers from the 2020 International Conference on Bioinformatics and Computational Biology(BICOB-2020)

Guest Editors: Oliver Eulenstein, Qin Ding and Hisham Al-MubaidNo Access

Unsupervised multi-instance learning for protein structure determination

Fardina Fathmiul Alam

Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA

E-mail Address: falam5@gmu.edu

Search for more papers by this author

and

Amarda Shehu

Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA

E-mail Address: amarda@gmu.edu

Search for more papers by this author

https://doi.org/10.1142/S0219720021400023Cited by:7 (Source: Crossref)

Abstract

Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions in silico relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of decoy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.

Keywords:

References

1. Boehr DD, Wright PE, How do proteins interact?, Science 320(5882) :1429–1430, 2008. Crossref, Medline, Google Scholar
2. Perdigao N, Heinrich J, Stolte C, Sabir KS, Buckley MJ, Tabor B, Signal B, Gloss BS, Hammang CJ, Rost B, Schafferhans A, ODonoghue SI, Unexpected features of the dark proteome, Proc Natl Acad Sci USA 112(52) :15898–1590, 2015. Crossref, Medline, Google Scholar
3. Lee J, Freddolino P, Zhang Y, Ab initio protein structure prediction, in From Protein Structure to Function with Bioinformatics, Rigden DJ (ed.), 2nd edn., Chap. 1, Springer, London, pp. 3–35, 2017. Crossref, Google Scholar
4. Das R, Four small puzzles that rosetta does not solve, PLoS One 6(5) :e20044, 2011. Crossref, Medline, Google Scholar
5. Molloy K, Saleh S, Shehu A, Probabilistic search and energy guidance for biased decoy sampling in ab-initio protein structure prediction, IEEE/ACM Trans Comput Biol and Bioinf 10(5) :1162–1175, 2013. Crossref, Medline, Google Scholar
6. Nussinov R, Tsai C, Shehu A, Jang H, Computational structural biology: The challenges ahead, Molecules 24(3) :637, 2018. Crossref, Google Scholar
7. Karasikov M, Pagès G, Grudinin S, Smooth orientation-dependent scoring function for coarse-grained protein quality assessment, Bioinformatics 35(16) :2801–2808, 2018, https://doi.org/10.1093/bioinformatics/bty1037, https://doi.org/10.1093/bioinformatics/bty1037. Crossref, Google Scholar
8. Zhang J, Xu D, Fast algorithm for clustering a large number of protein structural decoys, Intl Conf on Bioinf and Biomed (BIMB), IEEE, pp. 30–36, 2011. Crossref, Google Scholar
9. Alapati A, Bhattacharya D, clustq: Efficient protein decoy clustering using superposition-free weighted internal distance comparisons, Conf on Bioinf and Comput Biol (BCB), ACM, pp. 307–314, 2018. Crossref, Google Scholar
10. Herrera F, Ventura S, Bello R, Cornelis C, Zafra A, Sánchez-Tarragó D, Vluymans S, Unsupervised multiple instance learning, in Multiple Instance Learning, Springer, pp. 141–167, 2016. Crossref, Google Scholar
11. Alam FF, Rahman T, Shehu A, Evaluating autoencoder-based featurization and supervised learning for protein decoy selection, Molecules 25(5) :1146, 2020. Crossref, Google Scholar
12. Alam FF, Rahman T, Shehu A, Learning reduced latent representations of protein structure data, Conf on Bioinf and Comput Biol (BCB) Workshops: Comput Struct Biol Workshop (CSBW), ACM, pp. 592–597, 2019. Crossref, Google Scholar
13. Zhang J, Xu D, Fast algorithm for population-based protein structural model analysis, Proteomics 13(2) :221–229, 2013. Crossref, Medline, Google Scholar
14. Leaver-Fay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R et al., ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules, Methods Enzymol 487 :545–574, 2011. Crossref, Medline, Google Scholar
15. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A, Critical assessment of methods of protein structure prediction: Progress and new directions in round XI, Proteins: Struct, Funct, and Bioinf 84 :4–14, 2016. Crossref, Medline, Google Scholar
16. Kryshtafovych A, Monastyrskyy B, Fidelis K, Schwede T, Tramontano A, Assessment of model accuracy estimations in casp12, Proteins: Struct, Funct and Bioinf 86 :345–360, 2018. Crossref, Medline, Google Scholar
17. Alam F, Shehu A, From unsupervised multi-instance learning to identification of near-native protein structures, Proc 12th Int Conf Bioinformatics and Computational Biology, EPiC Series in Computing, Vol. 70, EasyChair, pp. 59–68, 2020, https://doi.org/10.29007/pjcf. Crossref, Google Scholar
18. Vorobjev YN, Hermans J, Free energies of protein decoys provide insight into determinants of protein stability, Protein Sci 10(12) :2498–2506, 2001. Crossref, Medline, Google Scholar
19. Dominy B, Brooks C, Identifying native-like protein structures using physics-based potentials, J Comput Chem 23 :147–160, 2002. Crossref, Medline, Google Scholar
20. Lu H, Skolnick J, A distance-dependent atomic knowledge-based potential for improved protein structure selection, Proteins 44 :223–232, 2001. Crossref, Medline, Google Scholar
21. Uziela K, Wallner B, ProQ2: Estimation of model accuracy implemented in rosetta, Bioinformatics 32(9) :1411–1413, 2016. Crossref, Medline, Google Scholar
22. Lazaridis T, Karplus M, Discrimination of the native from misfolded protein models with an energy function including implicit solvation, J Mol Biol 288(3) :477–487, 1999. Crossref, Medline, Google Scholar
23. McConkey BJ, Sobolev V, Edelman M, Discrimination of native protein structures using atom–atom contact scoring, Proc Natl Acad Sci USA 100(6) :3215–3220, 2003. Crossref, Medline, Google Scholar
24. Kryshtafovych A, Barbato A, Fidelis K, Monastyrskyy B, Schwede T, Tramontano A, Assessment of the assessment: Evaluation of the model quality estimates in CASP10, Proteins 82(2) :112–126, 2014. Crossref, Medline, Google Scholar
25. Chatterjee S, Ghosh S, Vishveshwara S, Network properties of decoys and CASP predicted models: A comparison with native protein structures, Molecular BioSystems 9(7) :1774–1788, 2013. Crossref, Medline, Google Scholar
26. Manavalan B, Lee J, SVMQA: Support–vector-machine-based protein single-model quality assessment, Bioinformatics 33(16) :2496–2503, 2017. Crossref, Medline, Google Scholar
27. Manavalan B, Lee J, Lee J, Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms, PLoS One 9(9) :e106542, 2014. Crossref, Medline, Google Scholar
28. Mirzaei S, Sidi T, Keasar C, Crivelli S, Purely structural protein scoring functions using support vector machine and ensemble learning, IEEE/ACM Trans Comp Biol & Bioinf 16(5):1515–1523, 2016. Crossref, Medline, Google Scholar
29. Nguyen SP, Shang Y, Xu D, DL-PRO: A novel deep learning method for protein model quality assessment, Int Conf Neural Networks (IJCNN), IEEE, pp. 2071–2078, 2014. Crossref, Google Scholar
30. Cao R, Bhattacharya D, Hou J, Cheng J, DeepQA: Improving the estimation of single protein model quality with deep belief networks, BMC Bioinf 17(1) :495, 2016. Crossref, Medline, Google Scholar
31. Hou J, Cao R, Cheng J, Deep convolutional neural networks for predicting the quality of single protein structural models, bioRxiv, p. 590620, 2019. Google Scholar
32. Hou J, Wu T, Cao R, Cheng J, Protein tertiary structure modeling driven by deep learning and contact distance prediction in casp13, Proteins 87 :1165–1178, 2019. Crossref, Medline, Google Scholar
33. Cheng J, Choe M, Elofsson A, Han K, Hou J, Maghrabi A, McGuffin L, Menéndez-Hurtado D, Olechnoviĉ K, Schwede T, Studer G, Uziela K, Venclovas E, Wallner B, Estimation of model accuracy in casp13, Proteins Struct Funct Bioinf 87, 2019, https://doi.org/10.1002/prot.25767. Crossref, Google Scholar
34. He Z, Shang Y, Xu D, Xu Y, Zhang J, Protein structural model selection based on protein-dependent scoring function, Statistic Interface 5(1) :109–115, 2012. Crossref, Google Scholar
35. Zhou H, Skolnick J, GOAP: A generalized orientation-dependent, all-atom statistical potential for protein structure prediction, Biophys J 101(8) :2043–2052, 2011. Crossref, Medline, Google Scholar
36. Qiu J, Sheffler W, Baker D, Noble WS, Ranking predicted protein structures with support vector regression, Proteins Struct Funct Bioinf 71(3) :1175–1182, 2008. Crossref, Medline, Google Scholar
37. Ray A, Lindahl E, Wallner B, Improved model quality assessment using proq2, BMC Bioinf 13(1) :224, 2012. Crossref, Medline, Google Scholar
38. Van Der Maaten L, Postma E, Van den Herik J, Dimensionality reduction: A comparative review, J Mach Learn Res 10(66–71) :13, 2009. Google Scholar
39. Gupta MR, Chen Y, Theory and use of the em algorithm, Found Trend Signal Process 4(3) :223–296, 2010. Crossref, Google Scholar
40. Schwartz GE, Estimating the dimension of a model, Ann Statistic 6(2) :461–464, 1978. Crossref, Google Scholar
41. Aho K, Derryberry D, Peterson T, Model selection for ecologists: The worldviews of AIC and BIC, Ecology 95(3) :631–636, 2014. Crossref, Medline, Google Scholar
42. Akhter N, Shehu A, From extraction of local structures of protein energy landscapes to improved decoy selection in template-free protein structure prediction, Molecules 23(1) :216, 2018. Crossref, Medline, Google Scholar
43. McLachlan AD, A mathematical procedure for superimposing atomic coordinates of proteins, Acta Cryst A 26(6) :656–657, 1972. Crossref, Google Scholar
44. Chollet F et al., Keras, https://keras.io. Google Scholar
45. Olson B, Shehu A, Multi-objective stochastic search for sampling local minima in the protein energy surface, ACM Conf Bioinf and Comp Biol (BCB), Washington, DC, pp. 430–439, 2013. Crossref, Google Scholar
46. Zhang G, Ma L, Wang X, Zhou X, Secondary structure and contact guided differential evolution for protein structure prediction, IEEE/ACM Trans Comput Biol and Bioinf, 2018, https://doi.org/10.1109/TCBB.2018.2873691. Google Scholar
47. Zhang GJ, Zhou GX, Yu XF, Hao H, Yu L, Enhancing protein conformational space sampling using distance profile-guided differential evolution, IEEE/ACM Trans Comput Biol and Bioinf 14(6) :1288–1301, 2017. Crossref, Medline, Google Scholar
48. Zaman A, Shehu A, Balancing multiple objectives in conformation sampling to control decoy diversity in template-free protein structure prediction, BMC Bioinf 20(1) :211, 2019, https://doi.org/10.1186/s12859-019-2794-5, https://doi.org/10.1186/s12859-019-2794-5. Crossref, Medline, Google Scholar
49. Berman HM, Henrick K, Nakamura H, Announcing the worldwide Protein Data Bank, Nat Struct Biol 10(12) :980–980, 2003. Crossref, Medline, Google Scholar