![]() |
The Pacific Symposium on Biocomputing brings together key researchers from the international biocomputing community. It is designed to be maximally responsive to the need for critical mass in subdisciplines within biocomputing. This book contains peer-reviewed articles in computational biology.
https://doi.org/10.1142/9789812799623_fmatter
The following sections are included:
https://doi.org/10.1142/9789812799623_0001
With the completion of a rough draft of the human genome sequence in sight, researchers are shifting to leverage this new information in the elucidation of the genetic basis of disease susceptibility and drug response. Massive genotyping and gene expression profiling studies are being planned and carried out by both academic/public institutions and industry. Researchers from different disciplines are all interested in the mining of the data coming from those studies; human geneticists, population geneticists, molecular biologists, computational biologists and even clinical practitioners. These communities have different immediate goals, but at the end of the day what is sought is analogous: the connection between variation in a group of genes or in their expression and observed phenotypes. There is an imminent need to link information across the huge data sets these groups are producing independently. However, there are tremendous challenges in the integration of polymorphism and gene expression databases and their clinical phenotypic annotation…
https://doi.org/10.1142/9789812799623_0002
We present a method for visually and quantitatively assessing the presence of structure in clustered data. The method exploits measurements of the stability of clustering solutions obtained by perturbing the data set. Stability is characterized by the distribution of pairwise similarities between clusterings obtained from sub samples of the data. High pairwise similarities indicate a stable clustering pattern. The method can be used with any clustering algorithm; it provides a means of rationally defining an optimum number of clusters, and can also detect the lack of structure in data. We show results on artificial and microarray data using a hierarchical clustering algorithm.
https://doi.org/10.1142/9789812799623_0003
An important problem in the analysis of microarray data is correlating the high-dimensional measurements with clinical phenotypes. In this paper, we develop predictive models for associating gene expression data from microarray experiments with such outcomes. They are based on the singular value decomposition. We propose new algorithms for performing gene selection and gene clustering based on these predictive models. The estimation procedure using the regression models occurs in two stages. First, the gene expression measurements are transformed using the singular value decomposition. The regression parameters in the model linking the principal components with the clinical responses are then estimated using maximum likelihood. We demonstrate the application of the methodology to data from a breast cancer study.
https://doi.org/10.1142/9789812799623_0004
Celera Genomics has constructed an automated computer system to support ultra high-throughput SNP genotyping that satisfies the increasing demand that disease association studies are placing on current genotyping facilities. This system consists of the seamless integration of target SNP selection, automated oligo design, in silico assay quality validation, laboratory management of samples, reagents and plates, automated allele calling, optional manual review of autocalls, regular status reports, and linkage disequilibrium analysis. Celera has proven the system by generating over 2.5 million genotypes from more than 10,000 SNPs, and is approaching the target capacity of over 10,000 genotypes per machine per hour using limited human intervention with state of the art laboratory hardware.
https://doi.org/10.1142/9789812799623_0005
Genomic information is becoming increasingly useful for studying the origins of disease. Recent studies have focused on discovering new genetic loci and the influence of these loci upon disease. However, it is equally desirable to go in the opposite direction – that is, to infer genotype from the clinical phenotype for increased efficiency of treatment. This paper proposes a methodology for such inference. Our method constructs a simple knowledge-based model without the need of a domain expert and is useful in situations that have very little data and/or no training data. The model relates a disease's symptoms to particular clinical states of the disease. Clinical information is processed using the model, where appropriate weighting of the symptoms is learned from observed diagnoses to subsequently identify the state of the disease presented in hospital visits. This approach applies to any simple genetic disorder that has defined clinical phenotypes. We demonstrate the use of our methods by inferring age of onset and DNA mutations for Huntington's disease patients.
https://doi.org/10.1142/9789812799623_0006
The identification and characterization of susceptibility genes for common complex multifactorial human diseases remains a statistical and computational challenge. Parametric statistical methods such as logistic regression are limited in their ability to identify genes whose effects are dependent solely or partially on interactions with other genes and environmental exposures. We introduce cellular automata (CA) as a novel computational approach for identifying combinations of single-nucleotide polymorphisms (SNPs) associated with clinical endpoints. This alternative approach is nonparametric (i.e. no hypothesis about the value of a statistical parameter is made), is model-free (i.e. assumes no particular inheritance model), and is directly applicable to case-control and discordant sib-pair study designs. We demonstrate using simulated data that the approach has good power for identifying high-order nonlinear interactions (i.e. epistasis) among four SNPs in the absence of independent main effects.
https://doi.org/10.1142/9789812799623_0007
Research directed toward discovering how genetic factors influence a patient's response to drugs requires coordination of data produced from laboratory experiments, computational methods, and clinical studies. A public repository of pharmacogenetic data should accelerate progress in the field of pharmacogenetics by organizing and disseminating public datasets. We are developing a pharmacogenetics knowledge base (PharmGKB) to support the storage and retrieval of both experimental data and conceptual knowledge. PharmGKB is an Internet-based resource that integrates complex biological, pharmacological, and clinical data in such a way that researchers can submit their data and users can retrieve information to investigate genotype–phenotype correlations. Successful management of the names, meaning, and organization of concepts used within the system is crucial. We have selected a frame-based knowledge-representation system for development of an ontology of concepts and relationships that represent the domain and that permit storage of experimental data. Preliminary experience shows that the ontology we have developed for gene-sequence data allows us to accept, store, and query data submissions.
https://doi.org/10.1142/9789812799623_0008
The self-organizing feature map (SOFM or SOM) neural network approach has been applied to a number of life sciences problems. In this paper, we apply SOFMs in predicting the resistance of the HIV virus to Saquinavir, an approved protease inhibitor. We show that a SOFM predicts resistance to Saquinavir with reasonable success based solely on the amino acid sequence of the HIV protease mutation. The best single network provided 69% coverage and 68% accuracy. We then combine a number of networks into various majority voting schemes. All of the combinations showed improved performance over the best single network, with an average of 85% coverage and 78% accuracy. Future research objectives are suggested based on these results.
https://doi.org/10.1142/9789812799623_0009
Ontologies are useful for organizing large numbers of concepts having complex relationships, such as the breadth of genetic and clinical knowledge in pharmacogenomics. But because ontologies change and knowledge evolves, it is time consuming to maintain stable mappings to external data sources that are in relational format. We propose a method for interfacing ontology models with data acquisition from external relational data sources. This method uses a declarative interface between the ontology and the data source, and this interface is modeled in the ontology and implemented using XML schema. Data is imported from the relational source into the ontology using XML, and data integrity is checked by validating the XML submission with an XML schema. We have implemented this approach in PharmGKB (http://www.pharmgkb.org/), a pharmacogenetics knowledge base. Our goals were to (1) import genetic sequence data, collected in relational format, into the pharmacogenetics ontology, and (2) automate the process of updating the links between the ontology and data acquisition when the ontology changes. We tested our approach by linking PharmGKB with data acquisition from a relational model of genetic sequence information. The ontology subsequently evolved, and we were able to rapidly update our interface with the external data and continue acquiring the data. Similar approaches may be helpful for integrating other heterogeneous information sources in order make the diversity of pharmacogenetics data amenable to computational analysis.
https://doi.org/10.1142/9789812799623_0010
Linkage disequilibrium mapping is an important tool in disease gene mapping. Recently, Toivonen et al. [1] introduced a haplotype mining (HPM) method that is applicable to data consisting of unrelated high-risk and normal haplotypes. The HPM method orders haplotypes by their strength of association with trait values, and uses all haplotypes exceeding a given threshold of strength of association to predict the gene location. In this study, we extend the HPM method to pedigree data by measuring the strength of association between a haplotype and quantitative traits of interest using the Quantitative Pedigree Disequilibrium Test proposed by Zhang et al. [2]. This family-based HPM (F-HPM) method can incorporate haplotype information across a set of markers and allow both missing marker data and ambiguous haplotype information. We use a simulation procedure to evaluate the statistical significance of the patterns identified from the F-HPM method. When the F-HPM method is applied to analyze the sequence data from the seven candidate genes in the simulated data sets in the 12th Genetic Analysis Workshop, the association between genes and traits can be detected with high power, and the estimated locations of the trait loci are close to the true sites.
https://doi.org/10.1142/9789812799623_0011
One of the key developments in biology during the past decade has been the completion of the sequencing of a number of key organisms, ranging from organisms with short genomes, such as various bacteria, to important model organisms such as Drosophila Melanogaster, and of course the human genome. As of February 2001, the complete genome sequences have been available for 4 eukaryotes, 9 archaea, 32 bacteria, over 600 viruses, and over 200 organelles (NCBI, 2001). The "draft" human sequence publicly available now spans over 90% of the genome, and finishing of the genome should be completed shortly. The large amount of available genomic sequence presents unprecedented opportunities for biological discovery and at the same time new challenges for the computational sciences. In particular, it has become apparent that comparative analyses of the genomes will play a central role in the development of computational methods for biological discovery. The principle of comparative analysis has long been a leitmotif in experimental biology, but the related computational challenge of large-scale comparative analysis leads to many interesting algorithmic challenges that remain at the forefront of computational biology research…
https://doi.org/10.1142/9789812799623_0012
The parameters by which alignments are scored can strongly affect sensitivity and specificity of alignment procedures. While appropriate parameter choices are well understood for protein alignments, much less is known for genomic DNA sequences. We describe a straightforward approach to scoring nucleotide substitutions in genomic sequence alignments, especially human-mouse comparisons. Scores are obtained from relative frequencies of aligned nucleotides observed in alignments of non-coding, non-repetitive genomic regions, and can be theoretically motivated through substitution models. Additional accuracy can be attained by down-weighting alignments characterized by low compositional complexity. We also describe an evaluation protocol that is relevant when alignments are intended to identify all and only the orthologous positions. One particular scoring matrix, called HOXD70, has proven to be generally effective for human-mouse comparisons, and has been used by the PipMaker server since July, 2000. We discuss but leave open the problem of effectively scoring regions of strongly biased nucleotide composition, such as low G+C content.
https://doi.org/10.1142/9789812799623_0013
The field of comparative genomics allows us to elucidate the molecular mechanisms necessary for the machinery of an organism by contrasting its genome against those of other organisms. In this paper, we contrast the genome of homo sapiens against C. Elegans, Drosophila melanogaster, and S. cerevisiae to gain insights on what structural domains are present in each organism. Previous work has assessed this using sequence-based homology recognition systems such as Pfam [1] and Interpro [2]. Here, we pursue a structure-based assessment, analyzing genomes according to domains in the SCOP structural domain dictionary. Compared to other eukaryotic genomes, we observe additional domains in the human genome relating to signal transduction, immune response, transport, and certain enzymes. Compared to the metazoan genomes, the yeast genome shows an absence of domains relating to immune response, cell-cell interactions, and cell signaling.
https://doi.org/10.1142/9789812799623_0014
Comparative genome maps are a powerful tool for interpreting the genomes of related organisms. The species maps which are the input to the process of constructing comparative maps are often themselves constructed from incomplete or inconsistent data, resulting in markers (or genes) whose order is not fully resolved. This incomplete marker order information is often handled by placing markers whose relative order cannot be reliably inferred together in a bin which is mapped to a common location. Previous automated and manual methods have handled such markers in an ad hoc or arbitrary way. We present efficient algorithms for comparative map construction that provide a principled method for handling unresolved marker order. The algorithms are based on a technique for efficiently computing a marker order that optimizes a natural parsimony criterion; in this way, they also yield a working hypothesis about the original incomplete data set.
https://doi.org/10.1142/9789812799623_0015
This paper presents a new general approach for the spatial representation and visualization of DNA molecule and its annotated information. This approach is based on a biological 3D model that predicts the complex spatial trajectory of huge naked DNA. With such modeling, a global vision of the sequence is possible, which is different and complementary to other representations as textual, linguistics or syntactic ones. The DNA is well known as a three-dimensional structure. Whereas, the spatial information plays a great part during its evolution and its interaction with the other biological elements This work will motivate investigations in order to launch new bioinformatics studies for the analysis of the spatial architecture of the genome. Besides, in order to obtain a friendly interactive visualization, a powerful graphic modeling is proposed including DNA complex trajectory management and its annotated-based content structuring. The paper describes spatial architecture modeling, with consideration of both biological and computational constraints. This work is implemented through a powerful graphic software tool, named ADN-Viewer. Several examples of visualization are shown for various organisms and biological elements.
https://doi.org/10.1142/9789812799623_0016
Pairwise stochastic context-free grammars ("Pair SCFGs") are powerful tools for finding conserved RNA structures, but unconstrained alignment to Pair SCFGs is prohibitively expensive. We develop versions of the Pair SCFG dynamic programming algorithms that can be conditioned on precomputed structures, significantly reducing the time complexity of alignment. We have implemented these algorithms for general Pair SCFGs in software that is freely available under the GNU Public License.
https://doi.org/10.1142/9789812799623_0017
We propose a new method for constructing genetic network from gene expression data by using Bayesian networks. We use nonparametric regression for capturing nonlinear relationships between genes and derive a new criterion for choosing the network in general situations. In a theoretical sense, our proposed theory and methodology include previous methods based on Bayes approach. We applied the proposed method to the S. cerevisiae cell cycle data and showed the effectiveness of our method by comparing with previous methods.
https://doi.org/10.1142/9789812799623_0018
A new method was developed for revealing of composite clusters of cis-elements in promoters of eukaryotic genes that are functionally related or coexpressed. A software system "ClusterScan" have been created that enables: (i) to train system on representative samples of promoters to reveal cis-elements that tend to cluster; (ii) to train system on a number of samples of functionally related promoters to identify functionally coupled transcription factors; (iii) to provide tools for searching of this clusters in genomic sequences to identify and functionally characterize regulatory regions in genome. A number of training samples of different functional and structural groups of promoters were analysed. Search for composite clusters in human chromosomes 21 and 22 reveals a number of interesting examples. Finally, a decision tree system was constructed to classify promoters of several functionally related gene groups. The decision tree system enables to identify new promoters and computationally predict their possible function.
https://doi.org/10.1142/9789812799623_0019
Genomic sequencing typically generates a large collection of unordered contigs or scaffolds. Contig ordering (also known as gap closure) is a non-trivial algorithmic and experimental problem since even relatively simple-to-assemble bacterial genomes typically result in large set of contigs. Neighboring contigs maybe separated either by gaps in read coverage or by repeats. In the later case we say that the contigs are separated by pseudogaps, and we emphasize the important difference between gap closure and pseudogap closure. The existing gap closure approaches do not distinguish between gaps and pseudogaps and treat them in the same way. We describe a new fast strategy for closing pseudogaps (repeat resolution). Since in highly repetitive genomes, the number of pseudogaps may exceed the number of gaps by an order of magnitude, this approach provides a significant advantage over the existing gap closure methods.
https://doi.org/10.1142/9789812799623_0020
Whole-genome phylogenetic studies require various sources of phylogenetic signals to produce an accurate picture of the evolutionary history of a group of genomes. In particular, sequence-based reconstruction will play an important role, especially in resolving more recent events. But using sequences at the level of whole genomes means working with very large amounts of data—large numbers of sequences—as well as large phylogenetic distances, so that reconstruction methods must be both fast and robust as well as accurate. We study the accuracy, convergence rate, and speed of several fast reconstruction methods: neighbor-joining, Weighbor (a weighted version of neighbor-joining), greedy parsimony, and a new phylogenetic reconstruction method based on disk-covering and parsimony search (DCM-NJ+MP). Our study uses extensive simulations based on random birth-death trees, with controlled deviations from ultrametricity. We find that Weighbor, thanks to its sophisticated handling of probabilities, outperforms other methods for short sequences, while our new method is the best choice for sequence lengths above 100. For very large sequence lengths, all four methods have similar accuracy, so that the speed of neighbor-joining and greedy parsimony makes them the two methods of choice.
https://doi.org/10.1142/9789812799623_0021
Accurate splice site prediction is a critical component of any computational approach to gene prediction in higher organisms. Existing approaches generally use sequence-based models that capture local dependencies among nucleotides in a small window around the splice site. We present evidence that computationally predicted secondary structure of moderate length pre-mRNA subsequences contains information that can be exploited to improve acceptor splice site prediction beyond that possible with conventional sequence-based approaches. Both decision tree and support vector machine classifiers, using folding energy and structure metrics characterizing helix formation near the splice site, achieve a 5–10% reduction in error rate with a human data set. Based on our data, we hypothesize that acceptors preferentially exhibit short helices at the splice site.
https://doi.org/10.1142/9789812799623_0022
Recognition of regulatory sites in unaligned DNA sequences is an old and well-studied problem in computational molecular biology. Recently, large-scale expression studies and comparative genomics brought this problem into a spotlight by generating a large number of samples with unknown regulatory signals. Here we develop algorithms for recognition of signals in corrupted samples (where only a fraction of sequences contain sites) with biased nucleotide composition. We further benchmark these and other algorithms on several bacterial and archaeal sites in a setting specifically designed to imitate the situations arising in comparative genomics studies.
https://doi.org/10.1142/9789812799623_0023
Sequence conservation during evolution is the foundation for the functional classification of the ennormous number of new protein sequences being discovered in the current era of genome sequencing. Conventional methods to detect homologous proteins are not always able to distinguish between true homologs and false positive hits in the twilight zone of sequence similarity. Several different approaches have been proposed to improve the sensitivity of these methods. Among the most successful are sequence profiles, multi-linked alignment, and threading. However, evolution might offer up other clues about a protein's ancestry that are sequence independent. Here we report the discovery of two such traces of evolution that could potentially be used to help infer the fold of a protein and hence improve the ability to predict the biochemical function. The first such evolutionary trace is a conservation of fold along the genome, i.e. nearby genes tend to share a fold more often than expected by chance alone—a not unexpected observation, but one which holds true even when no pair of genes being examined share appreciable homology. The second such evolutionary trace is, surprisingly, present in expression data: genes that are correlated in expression are more apt to share a fold than two randomly chosen genes. This result is surprising because correlations in expression have previously only been considered useful for determining biological function (e.g. what pathway a particular gene fits into), yet the observed fold enrichment in the expression data permits us to say something about biochemical function since fold corresponds strongly with biochemical function. Again, the fold enrichment observed in the expression data is apparent even when no pair of genes being examined share appreciable homology.
https://doi.org/10.1142/9789812799623_0024
In this paper, we discuss a multiple genome rearrangement problem: Given a collection of genomes represented by permutations, we generate the collection from some fixed genome, e.g., the identity permutation, in a minimum number of signed reversals. It is NP-hard, so efficient heuristics is important for finding its optimal solution. We at first discuss how to generate two and three genomes from a fixed genome by polynomial algorithms for some special cases. Then based on the polynomial algorithms, we obtain some approximation algorithms for generating two and three genomes in general, respectively. Finally, we apply these approximation algorithms to design a new approximation algorithm for generating more genomes. We also show by some experimental examples that the algorithms are efficient.
https://doi.org/10.1142/9789812799623_0025
We will introduce a way how we can achieve high speed homology search by only adding one off-the-shelf PCI board with one Field Programmable Gate Array (FPGA) to a Pentium based computer system in use. FPGA is a reconfigurable device, and any kind of circuits, such as pattern matching program, can be realized in a moment. The performance is almost proportional to the size of FPGA which is used in the system, and FPGAs are becoming larger and larger following Moore's law. We can easily obtain latest/larger FPGAs in the form off-the-shelf PCI boards with FPGAs, at low costs. The result which we obtained is as follows. The performance is most comparable with small to middle class dedicated hardware systems when we use a board with one of the latest FPGAs and the performance can be furthermore accelerated by using more number of FPGA boards. The time for comparing a query sequence of 2,048 elements with a database sequence of 64 million elements by the Smith-Waterman algorithm is about 34 sec, which is about 330 times faster than a desktop computer with a 1GHz PentiumIII. We can also accelerate the performance of a laptop computer using a PC card with one smaller FPGA. The time for comparing a query sequence (1,024) with the database sequence (64 million) is about 185 sec, which is about 30 times faster than the desktop computer.
https://doi.org/10.1142/9789812799623_0026
The recognition of complex carbohydrates and glycoconjugates as mediators of important biological processes has stimulated investigations into the understanding of the underlying principles. Unfortunately, the rate of generating new information has been slow during the last decade. Carbohydrates differ from the two other classes of biological macromolecules (proteins and DNA/RNA) in two important characteristics: their residues can be connected by many different linkage types and they can form highly branched molecules. As a consequence, carbohydrate chains contain an evolutionary potential of information content, which is several order of magnitude higher in a short sequence than any other oligomer formed by nucleotides or amino acids. This structural variance allows oligosaccharides to encode information for specific molecular recognition and to serve as determinants of protein folding and stability. However, structural complexity is also one of the major barriers for a rapid progress in the field of glycobiology because carbohydrates are laborious to analyze and extremely difficult to synthesize. Knowing that glycosylation of proteins and lipids are the most ubiquitous forms of posttranslational modification, the unexpectedly small number of genes identified in the initial analysis of the human genome sequence provides even more efforts for understanding the biological roles of oligosaccharides…
https://doi.org/10.1142/9789812799623_0027
Inspection of protein databases suggests that as many as 70% of proteins have potential N-glycosylation sites. Unfortunately glycoproteins often refuse to crystallize and NMR techniques do not allow an unambiguous determination of the complete conformation of the sugar part. Therefore, time-consuming complex simulation methods are often used to explore the conformational space of N-glycans. The generation of a comprehensive data base describing the conformational space of larger fragments of N-glycans taking into account the effects of branching is presented. High-temperature molecular dynamics simulations of essential N-glycan fragments are performed until conformational equilibrium has been reached. Free energy landscapes are calculated for each glycosidic linkage. All possible conformations for each N-glycan fragment are automatically assigned, ranked according to their relative population and stored in a database. These values are recalled for the generation of a complete set of all possible conformations for a given N-glycan topology. The constructed conformations are ranked according to their energy content. Since this approach allows to explore the complete conformational space of a given N-glycan within a few minutes of CPU-time on a standard PC, it is well suited to be used as a Web-Based application.
https://doi.org/10.1142/9789812799623_0028
GlycoSuiteDB, a database of glycan structures, has been constructed with an emphasis on quality, consistency and data integrity. Importance has been placed on making the database a reliable and useful resource for all researchers. This database can help researchers to identify what glycan structures are known to be attached to certain glycoproteins, as well as more generally identifying what types of glycan structures are associated with different states, for example, different species, tissues and diseases. To achieve this, a major effort has gone into data standardisation. Many rules and standards have been adopted, especially for representing glycan structure and biological source information. This paper describes some of the challenges faced during the continuous development of GlycoSuiteDB.
https://doi.org/10.1142/9789812799623_0029
The following sections are included:
https://doi.org/10.1142/9789812799623_0030
Even though the number and the size of sequence databases are growing rapidly, most new information relevant to biology research is still recorded as free text in journal articles and in comment fields of databases like the GenBank feature table annotations. As biomedical research enters the postgenome era, new kinds of databases that contain information beyond simple sequences are needed, for example, information on cellular localization, protein-protein interactions, gene regulation and the context of these interactions. The forerunners of such databases include KEGG1, DIP2, BIND3, among others. Such databases are still small in size and are largely hand curated. A factor that can accelerate their growth is the development of reliable literature data mining technologies…
https://doi.org/10.1142/9789812799623_0031
A growing body of works address automated mining of biochemical knowledge from digital repositories of scientific literature, such as MEDLINE. Some of these works use abstracts as the unit of text from which to extract facts. Others use sentences for this purpose, while still others use phrases. Here we compare abstracts, sentences, and phrases in MEDLINE using the standard information retrieval performance measures of recall, precision, and effectiveness, for the task of mining interactions among biochemical terms based on term co-occurrence. Results show statistically significant differences that can impact the choice of text unit.
https://doi.org/10.1142/9789812799623_0032
MEDSYNDIKATE is a natural language processor for automatically acquiring knowledge from medical finding reports. The content of these documents is transferred to formal representation structures which constitute a corresponding text knowledge base. The system architecture integrates requirements from the analysis of single sentences, as well as those of referentially linked sentences forming cohesive texts. The strong demands MEDSYNDIKATE poses to the availability of expressive knowledge sources are accounted for by two alternative approaches to (semi)automatic ontology engineering. We also present data for the knowledge extraction performance of MEDSYNDIKATE for three major syntactic patterns in medical documents.
https://doi.org/10.1142/9789812799623_0033
Due to the recent explosion of information in the biomedical field, it is hard for a single researcher to review the complex network involving genes, proteins, and interactions. We are currently building GeneScene, a toolkit that will assist researchers in reviewing existing literature, and report on the first phase in our development effort: extracting the relevant information from medical abstracts. We are developing a medical parser that extracts information, fills basic prepositional-based templates, and combines the templates to capture the underlying sentence logic. We tested our parser on 50 unseen abstracts and found that it extracted 246 templates with a precision of 70%. In comparison with many other techniques, more information was extracted without sacrificing precision. Future improvement in precision will be achieved by correcting three categories of errors.
https://doi.org/10.1142/9789812799623_0034
We describe the design of a robust parser for identifying and extracting biomolecular relations from the biomedical literature. Separate automata over distinct syntactic domains were developed for extraction of nominal-based relational information versus verbal-based relations. This allowed us to optimize the grammars separately for each module, regardless of any specific relation resulting in significantly better performance. A unique feature of this system is the use of text-based anaphora resolution to enhance the results of argument binding in relational extraction. We demonstrate the performance of our system on inhibition-relations, and present our initial results measured against an annotated text used as a gold standard for evaluation purposes. The results represent a significant improvement over previously published results on extracting such relations from Medline: Precision was 90 %, Recall 57 %, and Partial Recall 22%. These results demonstrate the effectiveness of a corpus-based linguistic approach to information extraction over Medline.
https://doi.org/10.1142/9789812799623_0035
We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S.cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.
https://doi.org/10.1142/9789812799623_0036
Faced with the need for human comprehension of any large collection of objects, a time honored approach has been to cluster the objects into groups of closely related objects. Individual groups are then summarized in some convenient manner to provide a more manageable view of the data. Such methods have been applied to document collections with mixed results. If a hard clustering of the data into mutually exclusive clusters is performed then documents are frequently forced into one cluster when they may contain important information that would also appropriately make them candidates for other clusters. If a soft clustering is used there still remains the problem of how to provide a useful summary of the data in a cluster. Here we introduce a new algorithm to produce a soft clustering of document collections that is based on the concept of a theme. A theme is conceptually a subject area that is discussed by multiple documents in the database. A theme has two potential representations that may be viewed as dual to each other. First it is represented by the set of documents that discuss the subject or theme and second it is also represented by the set of key terms that are typically used to discuss the theme. Our algorithm is an EM algorithm in which the term representation and the document representation are explicit components and each is used to refine the other in an alternating fashion. Upon convergence the term representation provides a natural summary of the document representation (the cluster). We describe how to optimize the themes produced by this process and give the results of applying the method to a database of over fifty thousand PubMed documents dealing with the subject of AIDS. How themes may improve access to a document collection is also discussed.
https://doi.org/10.1142/9789812799623_0037
The completion of major metazoan genomes, such as the human genome, has propelled life science research into a new era. Now that the "Book of Life", as some have called it, has been sequenced, and the vocabulary (genes) is being catalogued, the task now at hand is to identify the syntax and semantics of the book, and make sense out of what currently looks to us more like Lewis Carrol's "Jabberwocky" than Shakespeare's "Hamlet". The research and development for the next generation of bioinformatics tools for this task is on our critical path to unlocking the secrets of the human genome…
https://doi.org/10.1142/9789812799623_0038
The genomic sequencing of hundreds of organisms including homo sapiens, and the exponential growth in gene expression and proteomic data for many species has revolutionized research in biology. However, the computational analysis of these burgeoning datasets has been hampered by the sparse successes in combinations of data sources, representations, and algorithms. Here we propose the application of symbolic toolsets from the formal methods community to problems of biological interest, particularly signaling pathways, and more specifically mammalian mitogenic and stress responsive pathways. The results of formal symbolic analysis with extremely efficient representations of biological networks provide insights with potential biological impact. In particular, novel hypotheses may be generated which could lead to wet lab validation of new signaling possibilities. We demonstrate the graphic representation of the results of formal analysis of pathways, including navigational abilities, and describe the logical underpinnings of the approach. In summary, we propose and provide an initial description of an algebra and logic of signaling pathways and biologically plausible abstractions that provide the foundation for the application of high-powered tools such as model checkers to problems of biological interest.
https://doi.org/10.1142/9789812799623_0039
We present a statistical method for the prediction of protein–protein interactions within an organism. This approach is based on the treatment of proteins as collections of conserved domains, where each domain is responsible for a specific interaction with another domain. By characterizing the frequency with which specific domain–domain interactions occur within known interactions, our model can assign a probability to an arbitrary interaction between any two proteins with defined domains. Domain interaction data is complemented with information on the topology of a network and is incorporated into the model by assigning greater probabilities to networks displaying more biologically realistic topologies. We use Markov chain Monte Carlo techniques for the prediction of posterior probabilities of interaction between a set of proteins; allowing its application to large data sets. In this work we attempt to predict interactions in a set of 40 human proteins, known to form a connected network, and discuss methods for future improvement.
https://doi.org/10.1142/9789812799623_0040
We report the identification of several putative muscle–specific regulatory elements, and genes which are expressed preferentially in the muscle of the nematode Caenorhabditis elegans. We used computational pattern finding methods to identify cis–regulatory motifs from promoter regions of a set of genes known to express preferentially in muscle; each motif describes the potential binding sites for an unknown regulatory factor. The significance and specificity of the identified motifs were evaluated using several different control sequence sets. Using the motifs, we searched the entire C. elegans genome for genes whose promoter regions have a high probability of being bound by the putative regulatory factors. Genes that met this criterion and were not included in our initial set were predicted to be good candidates for muscle expression. Some of these candidates are additional, known muscle expressed genes and several others are shown here to be preferentially expressed in muscle cells by using GFP (green fluorescent protein) constructs. The methods described here can be used to predict the spatial expression pattern of many uncharacterized genes.
https://doi.org/10.1142/9789812799623_0041
We develop principled methods for the automatic induction (discovery) of genetic regulatory network models from multiple data sources and data modalities. Models of regulatory networks are represented as Bayesian networks, allowing the models to compactly and robustly capture probabilistic multivariate statistical dependencies between the various cellular factors in these networks. We build on previous Bayesian network validation results by extending the validation framework to the context of model induction, leveraging heuristic simulated annealing search algorithms and posterior model averaging. Using expression data in isolation yields results inconsistent with location data so we incorporate genomic location data to guide the model induction process. We combine these two data modalities by allowing location data to influence the model prior and expression data to influence the model likelihood. We demonstrate the utility of this approach by discovering genetic regulatory models of thirty-three variables involved in S. cerevisiae pheromone response. The models we automatically generate are consistent with the current understanding regarding this regulatory network, but also suggest new directions for future experimental investigation.
https://doi.org/10.1142/9789812799623_0042
Researchers in computational biology today make use of a large number of different software packages for modeling, analysis, and data manipulation and visualization. In this paper, we describe the ERATO Systems Biology Workbench (SBW), a software framework that allows these heterogeneous application components—written in diverse programming languages and running on different platforms—to communicate and use each others' data and algorithmic capabilities. Our goal is to create a simple, open-source software infrastructure which is effective, easy to implement and easy to understand. SBW uses a broker-based architecture and enables applications (potentially running on separate, distributed computers) to communicate via a simple network protocol. The interfaces to the system are encapsulated in client-side libraries that we provide for different programming languages. We describe the SBW architecture and the current set of modules, as well as alternative implementation technologies.
https://doi.org/10.1142/9789812799623_0043
Visualization of results for high performance computing pose special problems due to the complexity and the volume of data these systems manipulate. We present an approach for visualization of c-DNA microarray gene expression data in metabolic and regulatory pathways using multi-resolution animation at different levels of detail. We describe three scoring functions to characterize pathways at the transcriptional level based on gene expression, coregulation and cascade effect. We also assess the significance of each pathway score on the basis of their biological connotation.
https://doi.org/10.1142/9789812799623_0044
We address a commonly asked question about gene expression data sets: "What functional classes of genes are most interesting in the data?" In the methods we present, expression data is partitioned into classes based on existing annotation schemes. Each class is then given three separately derived "interest" scores. The first score is based on an assessment of the statistical significance of gene expression changes experienced by members of the class, in the context of the experimental design. The second is based on the co-expression of genes in the class. The third score is based on the learnability of the classification. We show that all three methods reveal significant classes in each of three different gene expression data sets. Many classes are identified by one method but not the others, indicating that the methods are complementary. The classes identified are in many cases of clear relevance to the experiment. Our results suggest that these class scoring methods are useful tools for exploring gene expression data.
https://doi.org/10.1142/9789812799623_0045
BIOLINGUA is a computational system designed to support biologists' efforts to construct models, make predictions, and interpret data. In this paper, we focus on the specific task of revising an initial model of gene regulation based on expression levels from gene microarrays. We describe BIOLINGUA's formalism for representing process models, its method for predicting qualitative correlations from such models, and its use of data to constrain search through the space of revised models. We also report experimental results on revising a model of photosynthetic regulation in Cyanobacteria to better fit expression data for both wild and mutant strains, along with model mutilation studies designed to test our method's robustness. In closing, we discuss related work on representing, discovering, and revising biological models, after which we propose some directions for future research.
https://doi.org/10.1142/9789812799623_0046
This paper reports the methods and results of a computer-based search for causal relationships in the gene-regulation pathway of galactose metabolism in the yeast Saccharomyces cerevisiae. The search uses recently published data from cDNA microarray experiments. A Bayesian method was applied to learn causal networks from a mixture of observational and experimental gene-expression data. The observational data were gene-expression levels obtained from unmanipulated "wild-type" cells. The experimental data were produced by deleting ("knocking out") genes and observing the expression levels of other genes. Causal relations predicted from the analysis on 36 galactose gene pairs are reported and compared with the known galactose pathway. Additional exploratory analyses are also reported.
https://doi.org/10.1142/9789812799623_0047
Phylogenetics has a long history, with many contributions to the field actually predating Darwin. Genomics, in contrast, is barely a decade old. While these two fields may be disparate in their ages, each has much to contribute to the further development of the other. Phylogenetics provides a rich source of methodologies able to facilitate and enhance genomic research. Genome structure and function analyses can be strengthened through examining evolutionary history, whereby patterns and rates of genomic evolution can be inferred. The analysis of evolutionary history is similarly empowered by a genomic perspective, providing both a wealth of characters ideally suited for phylogenetic analysis as well as interesting subject matter in the form of comparative genomics…
https://doi.org/10.1142/9789812799623_0048
The effects of the genomic revolution are beginning to be felt in all disciplines of the biological sciences. Evolutionary biology in general, and phylogenetic systematics in particular, are being revolutionized by these advances. The advent of rapid nucleotide sequencing techniques have provided phylogenetic biologists with the tools required to quickly and efficiently generate large amounts of character information. We use family Drosophilidae as a model system to study phylogenetics and genome evolution by combining high throughput sequencing methods from the field genomics and standard phylogenetic methodology. This paper presents preliminary results from this work. Separate data partitions, based on either gene function or linkage group, are compared to a combined analysis of all the data to assess support on phylogenetic trees.
https://doi.org/10.1142/9789812799623_0049
Evolution operates on whole genomes through mutations that change the order and strandedness of genes within the genomes. Thus analyses of gene-order data present new opportunities for discoveries about deep evolutionary events, provided that sufficiently accurate methods can be developed to reconstruct evolutionary trees. In this paper we present two new methods of character coding for parsimony-based analysis of genomic rearrangements: one called MPBE-2, and a new parsimony-based method which we call MPME (based on an encoding of Bryant), both variants of the MPBE method. We then conduct computer simulations to compare this class of methods to distance-based methods (NJ under various distance measures). Our empirical results show that two of our new methods return highly accurate estimates of the true tree, outperforming the other methods significantly, especially when close to saturation.
https://doi.org/10.1142/9789812799623_0050
Ancient gene duplication events have left many traces in vertebrate genomes. Reconciled trees represent the differences between gene family trees and the species phylogeny those genes are sampled from, allowing us to both infer gene duplication events and estimate a species phylogeny from a sample of gene families. We show that analysis of 118 gene families yields a phylogeny of vertebrates largely in agreement with other data. We formulate the problem of locating episodes of gene duplication as a set cover problem: given a species tree in which each node has a set of gene duplications associated with it, the smallest set of species nodes whose union includes all gene duplications specifies the locations of gene duplication episodes. By generating a unique mapping from this cover set we can determine the minimal number of such episodes at each location. When applied to our data, this method reveals a complex history of gene duplications in vertebrate evolution that does not conform to the "2R" hypothesis.
https://doi.org/10.1142/9789812799623_0051
Protein structure prediction is one of the most exciting and difficult problems in computational molecular biology. New computational advances, such as that currently underway with IBM's Blue Gene supercomputer (projected completion in 2005), as well as advances in the understanding of energy potentials and development of statistical methods for recognition of specific folds or protein classes, together promise better methods for protein structure prediction. Indeed, we have recently witnessed enormous progress in the field of protein folding, including successful approaches to computational structure prediction as documented in recent CASP competitions…
https://doi.org/10.1142/9789812799623_0052
A new method for considering solvation when calculating electrostatics for protein docking is proposed. The solvent-exposed charges are attenuated by induced solvent polarization charges. Modified charges are pre-calculated and the correction doesn't affect the speed of the actual simulation. The new Screened Charge Electrostatic Model (SChEM) results in an improved discrimination of near-native solutions from false positives in docking simulations as compared to conventional 'non-solvated' charge assignment. A series of protein-protein complexes were analyzed by running automated rigid-body Monte-Carlo docking simulations using the 3-D coordinates of the unbound components. In all but one case, the use of solvation screened charges for electrostatic calculations helped to improve the rank of the near-native solution after rigid-body simulations. The SChEM also drastically improved the results of the subsequent refinement of the interface side-chains. In all cases the final lowest energy solution was found within 3.0 Å r.m.s.d. of the crystal structure.
https://doi.org/10.1142/9789812799623_0053
We introduce a new sequence-similarity kernel, the spectrum kernel, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. Our kernel is conceptually simple and efficient to compute and, in experiments on the SCOP database, performs well in comparison with state-of-the-art methods for homology detection. Moreover, our method produces an SVM classifier that allows linear time classification of test sequences. Our experiments provide evidence that string-based kernels, in conjunction with SVMs, could offer a viable and computationally efficient alternative to other methods of protein classification and homology detection.
https://doi.org/10.1142/9789812799623_0054
Identifying positively selected amino acid sites is an important approach for making inference about the function of proteins; an amino acid site that is undergoing positive selection is likely to play a key role in the function of the protein. We present a new Bayesian method for identifying positively selected amino acid sites and apply the method to a data set of hemagglutinin sequences from the Influenza virus. We show that the results of the new methods are in accordance with results obtained using previous methods. More importantly, we also demonstrate how the method can be used for making further inferences about the evolutionary history of the sequences. For example, we demonstrate that sites that are positively selected tend to have a preponderance of conservative amino acid substitutions.
https://doi.org/10.1142/9789812799623_0055
Here we analyze sequence alignments for intrinsically disordered proteins. For 55 disordered protein families we measure the performance of different scoring matrices and propose one adjusted to disordered regions. An iterative algorithm of realigning sequences and recalculating matrices is designed and tested. For each matrix we also test a wide range of gap penalties. Results show an improvement in the ability to detect and discriminate related disordered proteins whose average sequence identity with the other family members is below 50%.
https://doi.org/10.1142/9789812799623_0056
Our previous methodology for ab initio prediction of protein structure is extended here to treat multiple-chain proteins. This involved modification of our united-residue (UNRES) force field and our Conformational Space Annealing (CSA) Global Optimization procedure. Good results have been obtained for both a four-and a three-helix protein from the CASP3 exercise.
https://doi.org/10.1142/9789812799623_0057
We present a polynomial time algorithm for estimating optimal HP sequences that fold to a specified target protein conformation based on Sun et al's Grand Canonical (GC) model. Application of the algorithm to related proteins taken from the PDB allows us to explore the nature of the protein genotype:phenotype map. Results suggest: (1) that the GC model captures important biological aspects of the mapping between protein sequences and their corresponding structures, and (2) the set of sequences that map to a target structure with optimal energy is affected by minor differences in structure.
https://doi.org/10.1142/9789812799623_0058
A novel method to analyze evolutionary change is presented and its application to the analysis of sequence data is discussed. The investigated method uses phylogenetic trees of related proteins with an evolutionary model in order to gain insight about protein structure and function. The evolutionary model, based on amino acid substitutions, contains adjustable parameters related to amino acid and sequence properties. A maximum likelihood approach is used with a phylogenetic tree to optimize these parameters. The model is applied to a set of Muscarinic receptors, members of the G-protein coupled receptor family. Here we show that the optimized parameters of the model are able to highlight the general structural features of these receptors.
https://doi.org/10.1142/9789812799623_0059
The Structural Genomics Initiative promises to deliver between 10,000 and 20,000 new protein structures within the next ten years. One challenge will be to predict the functions of these proteins from their structures. Since the newly solved structures will be enriched in proteins with little sequence identity to those whose structures are known, new methods for predicting function will be required. Here we describe the unique structural characteristics of O-glycosidases, enzymes that hydrolyze O-glycosidic bonds between carbohydrates. O-glycosidase function has evolved independently many times and enzymes that carry out this function are represented by a large number of different folds. We show that O-glycosidases none-the-less have characteristic structural features that cross sequence and fold families. The electrostatic surfaces of this class of enzymes are particularly distinctive. We also demonstrate that accurate prediction of O-glycosidase function from structure is possible.
https://doi.org/10.1142/9789812799623_0060
A new class of kernels for strings is introduced. These kernels can be used by any kernel-based data analysis method, including support vector machines (SVM). They are derived from probabilistic models to integrate biologically relevant information. We show how to compute the kernels corresponding to several classical probabilistic models, and illustrate their use by building a SVM for the problem of predicting the cleavage site of signal peptides from the amino-acid sequence of a protein. At a given rate of false positive this method retrieves up to 47% more true positives than the classical weight matrix method.
https://doi.org/10.1142/9789812799623_0061
We present an algorithm for exact protein structure prediction in the FCC-HP-model. This model is a lattice protein model on the face-centered-cubic lattice that models the main force of protein folding, namely the hydrophobic force. The structure prediction for this model can be based on the construction of hydrophobic cores. The main focus of the paper is on an algorithm for constructing maximally and submaximally compact hydrophobic cores of a given size. This algorithm treats core construction as a constraint satisfaction problem (CSP), and the paper describes its constraint model. The algorithm employs symmetry excluding constraint-based search6 and relies heavily on good upper bounds on the number of contacts. Here, we use and strengthen upper bounds presented earlier.8 The resulting structure prediction algorithm (including previous work8,7) handles sequences of sizes in the range of real proteins fast, i.e. we predict a first structure often within a few minutes. The algorithm is the first exact one for the FCC, besides full enumeration which is impracticable for chain lengths greater than about 15. We tested the algorithm succesfully up to sequence length of 160, which is far beyond the capabilities even of previous heuristic approaches.
https://doi.org/10.1142/9789812799623_0062
A new hydrophobic score will be presented in this paper for detecting native-like folds from a large set of decoy structures. A recent paper (B. D. Silverman, PNAS 98, 4996, 2001) had revealed that for globular proteins there exists a relatively universal constant of 0.74 for a hydrophobic ratio, which is defined as the ratio of radii from the protein centroid at which the second order hydrophobic moment and the zero order moment vanishes. This paper further defines a new hydrophobic score which will be used to examine protein decoys, in particular, the Holm & Sander, Park & Levitt and Baker decoy sets. It will be shown that the hydrophobic score and profile shapes can provide useful information that should be complementary to the information provided by other procedures, such as free energy calculations.