OPTIMAL ALGORITHM FOR FINDING DNA MOTIFS WITH NUCLEOTIDE ADJACENT DEPENDENCY
The research was supported in parts by the RGC grant HKU 7120/06E.
Finding motifs and the corresponding binding sites is a critical and challenging problem in studying the process of gene expression. String and matrix representations are two popular models to represent a motif. However, both representations share an important weakness by assuming that the occurrence of a nucleotide in a binding site is independent of other nucleotides. More complicated representations, such as HMM or regular expression, exist that can capture the nucleotide dependency. Unfortunately, these models are not practical (with too many parameters and require many known binding sites). Recently, Chin and Leung introduced the SPSP representation which overcomes the limitations of these complicated models. However, discovering novel motifs in SPSP representation is still a NP-hard problem. In this paper, based on our observations in real binding sites, we propose a simpler model, the Dependency Pattern Sets (DPS) representation, which is simpler than the SPSP model but can still capture the nucleotide dependency. We develop a branch and bound algorithm (DPS-Finder) for finding optimal DPS motifs. Experimental results show that DPS-Finder can discover a length-10 motif from 22 length-500 DNA sequences within a few minutes and the DPS representation has a similar performance as SPSP representation.