Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In this paper, we develop a machine learning system for determining gene functions from heterogeneous data sources using a Weighted Naive Bayesian network (WNB). The knowledge of gene functions is crucial for understanding many fundamental biological mechanisms such as regulatory pathways, cell cycles and diseases. Our major goal is to accurately infer functions of putative genes or Open Reading Frames (ORFs) from existing databases using computational methods. However, this task is intrinsically difficult since the underlying biological processes represent complex interactions of multiple entities. Therefore, many functional links would be missing when only one or two sources of data are used in the prediction. Our hypothesis is that integrating evidence from multiple and complementary sources could significantly improve the prediction accuracy. In this paper, our experimental results not only suggest that the above hypothesis is valid, but also provide guidelines for using the WNB system for data collection, training and predictions. The combined training data sets contain information from gene annotations, gene expressions, clustering outputs, keyword annotations, and sequence homology from public databases. The current system is trained and tested on the genes of budding yeast Saccharomyces cerevisiae. Our WNB model can also be used to analyze the contribution of each source of information toward the prediction performance through the weight training process. The contribution analysis could potentially lead to significant scientific discovery by facilitating the interpretation and understanding of the complex relationships between biological entities.
We present an algorithm for predicting transcription factor binding sites based on ChIP-chip and phylogenetic footprinting data. Our algorithm is robust against low promoter sequence similarity and motif rearrangements, because it does not depend on multiple sequence alignments. This, in turn, allows us to incorporate information from more distant species. Representative random data sets are used to estimate the score significance. Our algorithm is fully automatic, and does not require human intervention. On a recent S. cerevisiae data set, it achieves higher accuracy than the previously best algorithms. Adaptive ChIP-chip threshold and the modular positional bias score are two general features of our algorithm that increase motif prediction accuracy and could be implemented in other algorithms as well. In addition, since our algorithm works partly orthogonally to other algorithms, combining several algorithms can increase prediction accuracy even further. Specifically, our method finds 6 motifs not found by the 2nd best algorithm.
We review the proposed mathematical models of the response to osmotic stress in yeast. These models mainly differ in the choice of mathematical representation (e.g. Bayesian networks, ordinary differential equations, or rule-based models), the extent to which the modeling is data-driven, and predictability. The overview exemplifies how one biological system can be modeled with various modeling techniques and at different levels of resolution, and how the choice typically is based on the amount and quality of available data, prior information of the system, and the research question in focus. As a natural part of the overview, we discuss requirements, advantages, and limitations of the different modeling approaches.