PREDICTION OF GENE LOCATIONS USING DNA MARKOV CHAIN MODELS
Protein coding and non-coding regions of the DNA primary structure can be represented by non-homogeneous and homogeneous Markov chain models respectively. These models can be employed by an algorithm predicting gene locations in a newly sequenced DNA. The key notion of this algorithm is an a posteriori probability of protein coding function for the given fragment of DNA sequence. We use Markov chain models from the first through the fifth order for the calculation of the value of this probability. The parameters of the non-homogeneous and homogeneous Markov chain models have been derived from a training set of 479,589 bp coding and 245,307 bp non-coding prokaryotic (E. coli) DNA sequences. The predictive accuracy of the method has been determined for the control set of 373,845 bp coding and 131,538 bp non-coding sequences of E. coli DNA. For instance, the version of the algorithm that employs fourth order Markov chain models gives a 10.0% false negative rate and a 25.2% false positive rate when coding function is identifying for a fragment of the 96 bp length.