Chapter 8: Optimised Peptide Pattern Discovery
When analysing the patterns of protease cleavage sites or posttranslational modification sites based on peptide data, a key question is whether it is possible to discover interpretable and explainable as well visible rules by which how peptides are classified can be well-understood. A linear model benefits a better interpretation between experimental data sets such as peptides and peptide labels. However, the relationship between peptides used in either protease cleavage pattern discovery or posttranslational modification pattern discovery and peptide labels may not always be simple. Moreover, peptides are non-numerical data. On the other hand, most nonlinear models such as neural network models do not offer sufficient insight into data. The decision-tree algorithms or the random forest algorithms are capable of providing a better interpretation to a model. However, in order to discover the optimal models, an expensive exhaustive enumeration has to be considered. This is why the evolutionary computation approaches have provided a better way and have been well-employed in many areas for generating optimal or near optimal models with a better interpretation capability. This chapter will introduce a different type of machine learning approaches for this kind of biological pattern discovery. It is the genetic programming algorithm, which is one type of the evolutionary computation approaches. This chapter will introduce how the genetic programming algorithm can be used for discovering the interpretable rules for a peptide data set. This chapter will also show how the rules developed by the genetic programming models can interpret the residue interplay within peptides.