This volume contains papers presented at the 18th International Conference on Genome Informatics (GIW 2007) held at the Biopolis, Singapore from December 3 to 5, 2007. The GIW Series provides an international forum for the presentation and discussion of original research papers on all aspects of bioinformatics, computational biology and systems biology. Its scope includes biological sequence analysis, protein folding prediction, gene regulatory network, clustering algorithms, comparative genomics, and text mining. Boasting a history of 18 years, GIW is likely the longest-running international bioinformatics conference.
A total of 16 papers were selected for presentation at GIW 2007 and inclusion in this book. The notable authors include Ming Li (University of Waterloo, Canada), Minoru Kanehisa (Kyoto University, Japan), Vladimir Kuznetsov (Genome Institute of Singapore), Tao Jiang (UC Riverside, USA), Christos Ouzounis (European Bioinformatics Institute, UK), and Satoru Miyano (University of Tokyo, Japan). In addition, this book contains abstracts from the five invited speakers: Frank Eisenhaber (Bioinformatics Institute, Singapore), Sir David Lane (Institute of Molecular and Cell Biology, Singapore), Hanah Margalit (The Hebrew University of Jerusalem, Israel), Lawrence Stanton (Genome Institute of Singapore), and Michael Zhang (Cold Spring Harbor Laboratory, USA).
Sample Chapter(s)
Chapter 1: Detection of Monosaccharide Types from Coordinates (615 KB)
https://doi.org/10.1142/9781860949852_fmatter
CONTENTS
PREFACE
ACKNOWLEDGMENTS
COMMITTEES
https://doi.org/10.1142/9781860949852_0001
Almost half of biological molecules (proteins and metabolites) are extrapolated as glycosylated within cells. Detection of glycosylation patterns and of attached sugar types is therefore an important step in future glycomics research. We present two algorithms to detect sugar types in Haworth projection, i.e., from x-y coordinates. The algorithms were applied to the database of flavonoid and identified backbone-specific biases of sugar types and their conjugated positions. The algorithms contribute not only to bridge between polysaccharide databases and pathway databases, but also to detect structural errors in metabolic databases.
https://doi.org/10.1142/9781860949852_0002
Super-Secondary structure elements (super-SSEs) are the structurally conserved ensembles of secondary structure elements (SSEs) within a protein. They are of great biological interest. In this work, we present a method to formally represent and mine the sequence order independent super-SSE motifs that occur repeatedly in large data sets of protein structures. We represent a protein structure as a graph, and mine the common cliques from a set of protein graphs in order to find the motifs. We mine two categories of super-SSE motifs: the generic motifs that occur frequently across the entire database of protein structures, and the fold-preferential motifs that are concentrated in particular protein fold types. From the experimental data set of 600 proteins belonging to 15 large SCOP Folds, we have discovered 21 generic motifs and 75 fold-preferential motifs that are both statistically significant and biologically relevant. A number of the discovered motifs (both generic and fold-preferential) resemble the well-known super-SSE motifs in the literature such as beta hairpins, Greek keys, zinc fingers, etc. Some of the discovered motifs are of novel shapes that have not been documented yet. Our method is time-efficient where it can discover all the motifs across the 600 proteins in less than 14 minutes on a standalone PC. The discovered motifs are reported in our project webpage: http://www1.i2r.a-star.edu.sg/~azeyar/SuperSSE/
https://doi.org/10.1142/9781860949852_0003
Motivation.
Although protein structure prediction has made great progress in recent years, a protein model derived from automated prediction methods is subject to various errors. As methods for structure prediction develop, a continuing problem is how to evaluate the quality of a protein model, especially to identify some well predicted regions of the model, so that the structure biology community can benefit from automated structure prediction. It is also important to identify badly-predicted regions in a model so that some refinement measurements can be applied to.
Results.
We present a novel technique FragQA to accurately predict local quality of a sequence-structure (i.e., sequence-template) alignment generated by comparative modeling (i.e., homology modeling and threading). Different from previous local quality assessment methods, FragQA directly predicts cRMSD between a continuously aligned fragment determined by an alignment and the corresponding fragment in the native structure. FragQA uses an SVM (Support Vector Machines) regression method to perform prediction using information extracted from a single given alignment. Experimental results demonstrate that FragQA performs well on predicting local quality. More specifically, FragQA has prediction accuracy better than a top performer ProQres [18]. Our results indicate that (1) local quality can be predicted well; (2) local sequence evolutionary information (i.e., sequence similarity) is the major factor in predicting local quality; and (3) structure information such as solvent accessibility and secondary structure helps improving prediction performance.
https://doi.org/10.1142/9781860949852_0004
We evaluate the performance of six amino acid indices in B cell epitope residue prediction using the classical sliding window method on five data sets. Four of the indices: i.e. relative connectivity, clustering coefficient, closeness and betweenness are newly derived from the topological parameters of residue networks. The other two are Parker's hydrophilicity and Levitt's index, known as the best indices so far for B cell epitope prediction. On four of the data sets, the performance of all the indices was comparable and poor in general. When applied to one well-annotated data set, the performances improved and the 4 network based indices showed better performance than that of Parker's hydrophilicity and Levitt's index. When using the relative connectivity index on this data set, the prediction accuracy, sensitivity and specificity reached 73.6%, 73.0% and 75.0% respectively, with an area under the curve about 0.796. Thus, we suggested that this index is a good choice for B cell epitope prediction. It also indicates that the low performance of B cell epitope prediction is not only due to the methods and amino acid indices used, but also the data set as well. Interestingly, on the well-annotated data set, the performance of B cell epitope residue prediction is very similar to that of protein surface residue prediction, especially at the 10 and 20 Å2 cutoffs. It is suggested that the performance in surface residue prediction might form a theoretical upper limit for the performance of B cell epitope residue prediction methods.
https://doi.org/10.1142/9781860949852_0005
In the evolution of the eukaryotic genome, exon or domain shuffling has produced a variety of proteins. On the assumption that each fusion event between two independent protein-domains occurred only once in the evolution of metazoans, we can roughly estimate when the fusion events were happened. For this purpose, we made phylogenetic profiles of pair-wise domain-combinations of metazoans. The phylogenetic profiles can be expected to reflect the protein evolution of metazoan. Interestingly, the phylogenetic tree of metazoans, derived from the profiles, supported the “Ecdysozoa hypothesis” that is one of the major hypotheses for metazoan evolution. Further, the phylogenetic profiles showed the candidates of genes that were required for each clade-specific features in metazoan evolution. We propose that comparative proteome analysis focusing on pair-wise domain-combinations is a useful strategy for researching the metazoan evolution. Additionally, we found that the extant ecdysozoans share only fourteen domain-combinations in our profiles. Such a small number of ecdysozoan-specific domain-combinations is consistent with the extensive gene-losses through the evolution of ecdysozoans.
https://doi.org/10.1142/9781860949852_0006
We suggest a novel, parametric, approach to estimating the significance of the output of motif finders. Specifically, we rely on the good fit we observe between the 3-parameters Gamma family and the null distribution of motif scores. This fit was observed across multiple motif finders, background models and scoring functions. Under this parametric assumption we compute and show the utility of a conservative confidence interval for the p-value of the observed score. Since our method relies on the 3-parameters Gamma fit it should be applicable to a variety of finders.
https://doi.org/10.1142/9781860949852_0007
A polyadenine tail is found at the 3' end of nearly every fully processed eukaryotic mRNA and has been suggested to influence virtually all aspects of mRNA metabolism. The ability to predict polyadenylation site will allow us to define gene boundaries, predict number of genes present in a particular gene locus and perhaps better understand mRNA metabolism. To this end, we built an arabidopsis polyadenylation prediction model. The prediction model uses a machine learning method which consists of four sequential steps: feature generation, feature selection, feature integration and cascade classifier. We have tested our model on public datasets and achieved more than 97% sensitivity and specificity. We have also directly compared with another arabidopsis prediction model, PASS 1.0, and have achieved better results.
https://doi.org/10.1142/9781860949852_0008
Advances in high-throughput technologies, such as ChIP-chip and ChIP-PET (Chromatin Immuno-Precipitation Paired-End diTag), and the availability of human and mouse genome sequences now allow us to identify transcription factor binding sites (TFBS) and analyze mechanisms of gene regulation on the level of the entire genome. Here, we have developed a computational approach which uses ChIP-PET data and statistical modeling to assess experimental noise and identify reliable TFBS for c-Myc, STAT1 and p53 transcription factors in the human genome. We propose a mixture probabilistic model and develop computational programs for Monte Carlo simulation of ChIP-PET data to define the background noise of the sequence clustering and to identify the probability function of specific DNA-protein binding in the eukaryotic genome. Our approach demonstrates high reproducibility of the method and not only distinguishes bona fide TFBSs from non-specific TFBSs with a high specificity, but also provides algorithmic and computational basis for further optimization of experimental parameters of the ChIP-PET method.
https://doi.org/10.1142/9781860949852_0009
With the launch of the international HapMap project, the haplotype inference problem has attracted a great deal of attention in the computational biology community recently. In this paper, we study the question of how to efficiently infer haplotypes from genotypes of individuals related by a pedigree without mating loops, assuming that the hereditary process was free of mutations (i.e. the Mendelian law of inheritance) and recombinants. We model the haplotype inference problem as a system of linear equations as in [10] and present an (optimal) linear-time (i.e. O(mn) time) algorithm to generate a particular solution to the haplotype inference problem, where m is the number of loci (or markers) in a genotype and n is the number of individuals in the pedigree. Moreover, the algorithm also provides a general solution in O(mn2) time, which is optimal because the size of a general solution could be as large as Θ(mn2). The key ingredients of our construction are (i) a fast consistency checking procedure for the system of linear equations introduced in [10] based on a careful investigation of the relationship between the equations (ii) a novel linear-time method for solving linear equations without invoking the Gaussian elimination method. Although such a fast method for solving equations is not known for general systems of linear equations, we take advantage of the underlying loop-free pedigree graph and some special properties of the linear equations.
https://doi.org/10.1142/9781860949852_0010
Prediction of protein functional sites from 3D structure is an important problem, particularly as structural genomics projects produce hundreds of structures of unknown function, including novel folds and the structures of orphan sequences. The present paper shows how computed protonation properties provide unique and powerful capabilities for the prediction of catalytic sites from the 3D structure alone. These protonation properties of the ionizable residues in a protein may be computed from the 3D structure using the calculated electrical potential function. In particular, the shapes of the theoretical microscopic titration curves (THEMATICS) enable selection of the residues involved in catalysis or small molecule recognition with good sensitivity and precision. Results are shown for 169 annotated enzymes in the Catalytic Site Atlas (CSA). Performance, as measured by residue recall and precision, is clearly better than that of other 3D-structure-based methods. When compared with methods based on sequence alignments and structural comparisons, THEMATICS performance is competitive for well-characterized enzymes. However THEMATICS performance does not degrade in the absence of similarity, as do the alignment-based methods, even if there are few or no sequence homologues or few or no proteins of similar structure. It is further shown that the protonation properties perform well on open, unbound structures, even if there is substantial conformational change upon ligand binding.
https://doi.org/10.1142/9781860949852_0011
Peptide identification by tandem mass spectrometry (MS/MS) is one of the most important problems in proteomics. Recent advances in high throughput MS/MS experiments result in huge amount of spectra. Unfortunately, identification of these spectra is relatively slow, and the accuracies of current algorithms are not high with the presence of noises and post-translational modifications (PTMs). In this paper, we strive to achieve high accuracy and efficiency for peptide identification problem, with special concern on identification of peptides with PTMs. This paper expands our previous work on PepSOM with the introduction of two accurate modified scoring functions: Sλ for peptide identification and Sλ* for identification of peptides with PTMs. Experiments showed that our algorithm is both fast and accurate for peptide identification. Experiments on spectra with simulated and real PTMs confirmed that our algorithm is accurate for identifying PTMs.
https://doi.org/10.1142/9781860949852_0012
The detection of gene fusion events across genomes can be used for the prediction of functional associations of proteins, including physical interactions or complex formation. These predictions are obtained by the detection of similarity for pairs of ‘component’ proteins to ‘composite’ proteins. Since the amount of composite proteins is limited in nature, we augment this set by creating artificial fusion proteins from experimentally determined protein interacting pairs. The goal is to study the extent of protein interaction partners with increasing phylogenetic distance, using an automated method. We have thus detected component pairs within seven entire genome sequences of similar size, using artificially generated composite proteins that have been shown to interact experimentally. Our results indicate that protein interactions are not conserved over large phylogenetic distances. In addition, we provide a set of predictions for functionally associated proteins across seven species using experimental information and demonstrate the applicability of fusion analysis for the comparative genomics of protein interactions.
https://doi.org/10.1142/9781860949852_0013
We propose a statistical method based on graphical Gaussian models for estimating large gene networks from DNA microarray data. In estimating large gene networks, the number of genes is larger than the number of samples, we need to consider some restrictions for model building. We propose weighted lasso estimation for the graphical Gaussian models as a model of large gene networks. In the proposed method, the structural learning for gene networks is equivalent to the selection of the regularization parameters included in the weighted lasso estimation. We investigate this problem from a Bayes approach and derive an empirical Bayesian information criterion for choosing them. Unlike Bayesian network approach, our method can find the optimal network structure and does not require to use heuristic structural learning algorithm. We conduct Monte Carlo simulation to show the effectiveness of the proposed method. We also analyze Arabidopsis thaliana microarray data and estimate gene networks.
https://doi.org/10.1142/9781860949852_0014
We present a new method to describe tissue-specific function that leverages the advantage of the Cap Analysis of Gene Expression (CAGE) data. The CAGE expression data represent the number of mRNAs of each gene in a sample. The feature enables us to compare or add the expression amount of genes in the sample. As usual methods compared the gene expression values among tissues for each gene respectively and ruled out to compare them among genes, they have not exploited the feature to reveal tissue specificity. To utilize the feature, we used Gene Ontology terms (GO-terms) as unit to sum up the expression values and described specificities of tissues by them. We regard GO-terms as events that occur in the tissue according to probabilities that are defined by means of the CAGE. Our method is applied to mouse CAGE data on 22 tissues. Among them, we show the results of molecular functions and cellular components on liver. We also show the most expressed genes in liver to compare with our method. The results agree well with well-known specific functions such as amino acid metabolisms of liver. Moreover, the difference of inter-cellular junction among liver, lung, heart, muscle and prostate gland are apparently observed. The results of our method provide researchers a clue to the further research of the tissue roles and the deeper functions of the tissue-specific genes. All the results and supplementary materials are available via our web site.
https://doi.org/10.1142/9781860949852_0015
Identifying lethal proteins is important for understanding the intricate mechanism governing life. Researchers have shown that the lethality of a protein can be computed based on its topological position in the protein-protein interaction (PPI) network. Performance of current approaches has been less than satisfactory as the lethality of a protein is a functional characteristic that cannot be determined solely by network topology. Furthermore, a significant number of lethal proteins have low connectivity in the interaction networks but are overlooked by most current methods.
Our work reveals that a protein's lethality correlates more strongly with its “functional centrality” than pure topological centrality. We define functional centrality as the topological centrality within a subnetwork of proteins with similar functions. Evaluation experiments on four Saccharomyces cerevisiae PPI datasets showed that NFC performed significantly better than all the other existing computational techniques. Our method was able to detect low connectivity lethal proteins that were previously undetected by conventional methods. The results and an online version of NFC is available at http://lethalproteins.i2r.a-star.edu.sg
https://doi.org/10.1142/9781860949852_0016
In silico approaches to the identification of bacterial promoters are hampered by poor conservation of their characteristic binding sites. This suggests that the usual position weight matrix models of bacterial promoters are incomplete. A number of methods have been used to overcome this inadequacy, one of which is to incorporate structural properties of DNA. In this paper we describe an extension of the promoter description to include SIDD (stress induced duplex destabilization), DNA curvature and stacking energy. Although we report the best result to date for a realistic promoter prediction task, surprisingly, DNA structural properties did not contribute significantly to this result. We also demonstrate for the first time, that sigma-54 promoters have a stronger association with SIDD than do other promoter types.
https://doi.org/10.1142/9781860949852_0017
Even when it is acknowledge that biomedical sciences are still essentially experimental and they lack a predictive theory in most subfields, it is the more important to underline the few niches where theoretical/computational approaches add creatively to the biological insight. Protein sequence analysis can predict aspects of molecular and cellular function in many cases and, in this way, decisively direct follow-up experiments for the characterization of yet uncharacterized genes and the discovery of new cellular pathways.
Therefore, the analysis of gene/protein sequences is advised to become an integral part of any molecular and cellular biological research, best in the early and planning phase since this allows avoiding unnecessary experiments. At the same time, typical mutational, expression profiling or interaction screens generate dozens or hundreds of protein targets that might require in-depth sequence analysis that, for example with available WWW-tools, will take days for a single target. The ANNOTATOR software suite provides the environment to carry out all routine steps for protein sequence analysis automatically and to enable the researcher to focus her/his time on thinking over the results. The ANNOTATOR has ca. 40 academic tools for protein sequence studies and all major databases built-in together with a number of sophisticated workflows that have shown their potential in previous discoveries.
The talk will give an insight into the biological and software design concepts of the ANNOTATOR. The application discussed include the discoveries SET domain, ATGL and Ecol functions, the prediction of various posttranslational modifications from protein sequence as well as the extension of the ANNOTATOR for protein mass-spectrometry data analysis tasks.
https://doi.org/10.1142/9781860949852_0018
Somatic mutations in the p53 gene occur in half of all human cancers and germ line mutations in p53 are responsible for the family cancer predisposition known as Li-Fraumeni Syndrome. In those cancers that retain the normal p53 gene other components of the p53 pathway are often damaged. Recently two mouse models have suggested that p53 activity may also affect aging. The p53 response is induced by a wide variety of different stress signals and when activated can induce cell cycle arrest, cell senescence or cell death. Many currently used cancer treatments activate the p53 response through a DNA damage dependant pathway, and p53 gene therapy has recently gained clinical approval in China . In mice and men the threshold of the response is very finally balanced and controlled by a number of regulatory proteins. Of particular interest is the Mdm2 protein, a ubiquitin E3 ligase that binds to p53 and targets it for degradation. A recently discovered polymorphism in the Mdm2 promoter may affect the age of onset of cancer in man. Drugs that target the Mdm2 pathway can act as non-genotoxic activators of the p53 response and one of these is currently in clinical trial in Singapore.
Understanding in detail how the p53 response is regulated may allow the pharmaceutical manipulation of the pathway. We have very recently discovered that the p53 gene has a more complex structure than has been appreciated for the last twenty years and several new iso-forms of p53 have been characterized potentially yielding new sources of individual variation and new targets for therapy.
https://doi.org/10.1142/9781860949852_0019
Small non-coding RNAs have gained recently much interest, as it has become evident that they are wide-spread in both pro- and eukaryotes and play important roles in post-transcriptional regulation of gene expression. These molecules present intriguing computational challenges: How can small RNA-encoding genes be identified based on the genome sequence? How many such genes are present in a genome? How can their gene targets be identified? What are the properties of regulation by small RNAs in comparison to other types of regulation, such as transcriptional regulation and protein-protein interaction? How is post-transcriptional regulation by small RNAs integrated with transcriptional regulation in the cellular networks? In my talk I will touch upon these questions and describe our attempts to address them. Intriguingly, viruses also encode regulatory RNAs, some of which play a role in cross-talk with the host. By a combination of computational and experimental approaches we identified human targets of viral microRNAs and showed that viruses use microRNAs for evasion of the host immune system. In my talk I will elaborate on these and other interesting human targets of viral microRNAs. Our results have promising therapeutic applications for both immunosuppressive therapy by mimicking the role of the viral microRNAs and for anti-viral therapy by using anti-sense molecules against them.
https://doi.org/10.1142/9781860949852_0020
REST (RE1 silencing transcription factor) is a protein that regulates neuronal gene expression. REST binds to a highly conserved 21-bp RE1 element and recruits corepressors to repress transcription. Recently REST was found to be expressed in embryonic stem cell (ESC) and was identified as a direct target of Nanog and Oct4, two transcription factors critical in maintaining the pluripotency and self-renewal of ESC. We are interested in understanding the role that REST plays in ESC. We have identified hundreds of targets genes directly regulated by REST in ESC by performing comprehensive chromatin immunoprecipitation (ChIP)-on-chip experiments. A computational approach was used to identify ~900 RE1 elements within the mouse genome. We then constructed an oligonucleotide array that contained unique probes for all these RE1 sites for our ChIPon-chip experiments. Our results showed that REST binds to > 500 RE1 elements. We are now assessing REST occupancy by a comprehensive sequencing based ChIP method (ChIP-PET) for a more unbiased search for REST targets. Using these two different methods, we are able to identify REST targets at the genome-wide level, which will provide us with a clearer picture of the role of REST in the regulatory network that controls ESC differentiation.
https://doi.org/10.1142/9781860949852_0021
Identification of direct targets of an individual or a combination of transcription factors (TFs) is central to determination of regulatory network architecture. Experimental approaches require a combination of expression profiling and binding assay to accurately identify direct targets.
Here we propose an adaptive determination of the gene activation thresholds by using regression splines. Since the thresholds are learnt adaptively from the expression data, the identified targets depend on the physiological condition under which the mRNA sample was obtained. It can work with data from a single condition and no separation of genes into foreground and background sets is necessary. Using human cell-cycle as an example, we show that the E2F targets that we identify at the G1/S phase are significantly different from those at the G2/M phase. We verify known targets and find several novel direct targets of E2F in the G2/M phase.
https://doi.org/10.1142/9781860949852_bmatter
AUTHOR INDEX
Sample Chapter(s)
Chapter 1: Detection of Monosaccharide Types from Coordinates (615k)