![]() |
The Pacific Symposium on Biocomputing (PSB) 2007 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2007 will be held January 3–7, 2007 at the Grand Wailea, Maui. Tutorials will be offered prior to the start of the conference.
PSB 2007 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.
The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's “hot topics.” In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field.
Sample Chapter(s)
Chapter 1: Protein Interactions and Disease (106 KB)
https://doi.org/10.1142/9789812772435_fmatter
Preface.
CONTENTS.
https://doi.org/10.1142/9789812772435_0001
No abstract received.
https://doi.org/10.1142/9789812772435_0002
It is widely believed that comparing discrepancies in the protein-protein interaction (PPI) networks of individuals will become an important tool in understanding and preventing diseases. Currently PPI networks for individuals are not available, but gene expression data is becoming easier to obtain and allows us to represent individuals by a co-integrated gene expression/protein interaction network. Two major problems hamper the application of graph kernels – state-of-the-art methods for whole-graph comparison – to compare PPI networks. First, these methods do not scale to graphs of the size of a PPI network. Second, missing edges in these interaction networks are biologically relevant for detecting discrepancies, yet, these methods do not take this into account. In this article we present graph kernels for biological network comparison that are fast to compute and take into account missing interactions. We evaluate their practical performance on two datasets of co-integrated gene expression/PPI networks.
https://doi.org/10.1142/9789812772435_0003
We introduce Chalkboard, a prototype tool for representing and displaying cell-signaling pathway knowledge, for carrying out simple qualitative reasoning over these pathways, and for generating quantitative biosimulation code. The design of Chalkboard has been driven by the need to quickly model and visualize alternative hypotheses about uncertain pathway knowledge. Chalkboard allows the biologists to test in silico the implications of various hypotheses. To fulfill this need, chalkboard includes (1) a rich ontology of pathway entities and interactions, which is ultimately informed by the basic chemistry and physics among molecules, and (2) a form of qualitative reasoning that computes causal chains and feedback loops within the network of entities and reactions. We demonstrate Chalkboard's capabilities in the domain of APP proteolysis, a pathway that plays a key role in the pathogenesis of Alzheimer's disease. In this pathway (as is common), information is incomplete and parts of the pathways are conjectural, rather than experimentally verified. With Chalkboard, we can carry out in silico perturbation experiments and explore the consequences of different conjectural connections and relationships in the network. We believe that pathway reasoning capabilities and in silico experiments will become a critical component of the hypothesis generation phase of modern biological research.
https://doi.org/10.1142/9789812772435_0004
Motivation: The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining.
Method: An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness.
Results: Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.
https://doi.org/10.1142/9789812772435_0005
Transient and low-affinity protein complexes pose a challenge to existing experimental methods and traditional computational techniques for structural determination. One example of such a disordered complex is that formed by trimers of influenza virus fusion peptide inserted into a host cell membrane. This fusion peptide is responsible for mediating viral infection, and spectroscopic data suggest that the peptide forms loose multimeric associations that are important for viral infectivity. We have developed an ensemble simulation technique that harnesses >1000 molecular dynamics trajectories to build a structural model for the arrangement of fusion peptide trimers. We predict a trimer structure in which the fusion peptides are packed into proximity while maintaining their monomeric structure. Our model helps to explain the effects of several mutations to the fusion peptide that destroy viral infectivity but do not measurably alter peptide monomer structure. This approach also serves as a general model for addressing the challenging problem of higher-order protein organization in cell membranes.
https://doi.org/10.1142/9789812772435_0006
ABC transporter proteins couple the energy of ATP binding and hydrolysis to substrate transport across a membrane. In humans, clinical studies have implicated mutations in 19 of the 48 known ABC transporters in diseases such as cystic fibrosis and adrenoleukodystrophy. Although divergent in sequence space, the overall topology of these proteins, consisting of two transmembrane domains and two ATP-binding cassettes, is likely to be conserved across diverse organisms. We examine known intra-transporter domain interfaces using crystallographic structures of isolated and complexed domains in ABC transporter proteins and find that the nucleotide binding domain interfaces are better conserved than interfaces at the transmembrane domains. We then apply this analysis to identify known disease-associated point and deletion mutants for which disruption of domain-domain interfaces might indicate the mechanism of disease. Finally, we suggest a possible interaction site based on conservation of sequence and disease-association of point mutants.
https://doi.org/10.1142/9789812772435_0007
Identification of ligand-receptor interactions is important for drug design and treatment of diseases. Difficulties in detecting these interactions using high-throughput experimental techniques motivate the development of computational prediction methods. We propose a novel threading algorithm, LTHREADER, which generates accurate local sequence-structure alignments and integrates statistical and energy scores to predict interactions within ligand-receptor families. LTHREADER uses a profile of secondary structure and solvent accessibility predictions with residue contact maps to guide and constrain alignments. Using a decision tree classifier and low-throughput experimental data for training, it combines information inferred from statistical interaction potentials, energy functions, correlated mutations and conserved residue pairs to predict likely interactions. The significance of predicted interactions is evaluated using the scores for randomized binding surfaces within each family. We apply our method to cytokines, which play a central role in the development of many diseases including cancer and inflammatory and autoimmune disorders. We tested our approach on two representatives from different structural classes (all-alpha and all-beta proteins) of cytokines. In comparison with state-of-the-art threader, RAPTOR, LTHREADER generates on average 20% more accurate alignments of interacting residues. Furthermore, in cross-validation tests, LTHREADER correctly predicts experimentally confirmed interactions for a common binding mode within the 4-helical long chain cytokine family with 75% sensitivity and 86% specificity. For the TNF-like family our method achieves 70% sensitivity with 55% specificity. This is a dramatic improvement over existing methods. Moreover, LTHREADER predicts several novel potential ligand-receptor cytokine interactions.
https://doi.org/10.1142/9789812772435_0008
The study of protein-protein interactions is essential to define the molecular networks that contribute to maintain homeostasis of an organism's body functions. Disruptions in protein interaction networks have been shown to result in diseases in both humans and animals. Monogenic diseases disrupting biochemical pathways such as hereditary coagulopathies (e.g. hemophilia), provided a deep insight in the biochemical pathways of acquired coagulopathies of complex diseases. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia. Similarly, more complex diseases such as different cancers have been shown to result from malfunctions of common proteins pathways. In order to discover, in high throughput, the molecular underpinnings of poorly characterized diseases, we present a statistical method to identify shared protein interaction network(s) between diseases. Integrating (i) a protein interaction network with (ii) disease to protein relationships derived from mining Gene Ontology annotations and the biomedical literature with natural language understanding (PhenoGO), we identified protein-protein interactions that were associated with pairs of diseases and calculated the statistical significance of the occurrence of interactions in the protein interaction knowledgebase. Significant correlations between diseases and shared protein networks were identified and evaluated in this study, demonstrating the high precision of the approach and correct non-trivial predictions, signifying the potential for discovery. In conclusion, we demonstrate that the associations between diseases are directly correlated to their underlying protein-protein interaction networks, possibly providing insight into the underlying molecular mechanisms of phenotypes and biological processes disrupted in related diseases.
https://doi.org/10.1142/9789812772435_0009
Post-genomic advances in bioinformatics have refined drug-design strategies, by focusing on the reduction of serious side-effects through the identification of enzymatic targets. We consider the problem of identifying the enzymes (i.e., drug targets), whose inhibition will stop the production of a given target set of compounds, while eliminating minimal number of non-target compounds. An exhaustive evaluation of all possible enzyme combinations to find the optimal solution subset may become computationally infeasible for very large metabolic networks. We propose a scalable iterative algorithm which computes a sub-optimal solution within reasonable time-bounds. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. It evaluates immediate precursors of the target compounds and iteratively moves backwards to identify the enzymes whose inhibition will stop the production of the target compounds while incurring minimum side-effects. We show that our algorithm converges to a sub-optimal solution within a finite number of such iterations. Our experiments on the E.Coli metabolic network show that the average accuracy of our method deviates from that of the exhaustive search only by 0.02%. Our iterative algorithm is highly scalable. It can solve the problem for the entire metabolic network of Escherichia Coli in less than 10 seconds.
https://doi.org/10.1142/9789812772435_0010
Smallpox is a deadly disease that can be intentionally reintroduced into the human population as a bioweapon. While host gene expression microarray profiling can be used to detect infection, the analysis of this information using unsupervised and supervised classification techniques can produce contradictory results. Here, we present a novel computational approach to incorporate molecular genome annotation features that are key for identifying early infection biomarkers (EIB). Our analysis identified 58 EIBs expressed in peripheral blood mononuclear cells (PBMCs) collected from 21 cynomolgus macaques (Macaca fascicularis) infected with two variola strains via aerosol and intravenous exposure. The level of expression of these EIBs was correlated with disease progression and severity. No overlap between the EIBs co-expression and protein interaction data reported in public databases was found. This suggests that a pathogen-specific re-organization of the gene expression and protein interaction networks occurs during infection. To identify potential genome-wide protein interactions between variola and humans, we performed a protein domain analysis of all smallpox and human proteins. We found that only 55 of the 161 protein domains in smallpox are also present in the human genome. These co-occurring domains are mostly represented in proteins involved in blood coagulation, complement activation, angiogenesis, inflammation, and hormone transport. Several of these proteins are within the EIBs category and suggest potential new targets for the development of therapeutic countermeasures.
https://doi.org/10.1142/9789812772435_0011
No abstract received.
https://doi.org/10.1142/9789812772435_0012
A significant challenge in metabolomics experiments is extracting biologically meaningful data from complex spectral information. In this paper we compare two techniques for representing 1D NMR spectra: "Spectral Binning" and "Targeted Profiling". We use simulated 1D NMR spectra with specific characteristics to assess the quality of predictive multivariate statistical models built using both data representations. We also assess the effect of different variable scaling techniques on the two data representations. We demonstrate that models built using Targeted Profiling are not only more interpretable than Spectral Binning models, but are more robust with respect to compound overlap, and variability in solution conditions (such as pH and ionic strength). Our findings from the synthetic dataset were validated using a real-world dataset.
https://doi.org/10.1142/9789812772435_0013
The term metabolic profiling is often used to denote the systematic characterization of the unique biochemical trails or fingerprints left behind by cellular processes. Advances in computational biosciences are often invaluable in dealing with the huge amount of raw data generated from the countless biochemical intermediates that flood the cell at any given time. As a prelude to metabolic profiling, it is essential to completely profile and compile all related information about the genetic and proteomic data. Profiling tools in bioinformatics refer to all those software (web based and downloadable) that compile all related information in single user-interfaces. Generally, these interfaces take a query such as a DNA, RNA, or protein sequence or keyword; and search one or more databases for information related to that sequence. Summaries and aggregate results are provided in a single standardized format that would otherwise have required visits to many smaller sites or direct literature searches to compile. In other words they are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases.
https://doi.org/10.1142/9789812772435_0014
Several QSAR models have been developed using a linear optimization approach that enabled distinguishing metabolic substances isolated from human-, bacterial-, plant- and fungal-cells. Seven binary classifiers based on a k-Nearest Neighbors method have been created using a variety of 'inductive' and traditional QSAR descriptors that allowed up to 95% accurate recognition of the studied groups of chemical substances.
The conducted comparative QSAR analysis based on the above mentioned linear optimization approach helped to identify the extent of overlaps between the groups of compounds, such as cross-recognition of fungal and bacterial metabolites and association between fungal and plant substances. Human metabolites exhibited very different QSAR behavior in chemical space and demonstrated no significant overlap with bacterial-, fungal-, and plant-derived molecules.
When the developed QSAR models were applied to collections of conventional human therapeutics and antimicrobials, it was observed that the first group of substances demonstrate the strongest association with human metabolites, while the second group exhibit tendency of 'bacterial metabolite – like' behavior. We speculate that the established 'drugs - human metabolites' and 'antimicrobials – bacterial metabolites' associations result from strict bioavailability requirements imposed on conventional therapeutic substances, which further support their metabolite-like properties.
It is anticipated that the study may bring additional insight into QSAR determinants for human-, bacterial-, fungal- and plant metabolites and may help rationalizing design and discovery of novel bioactive substances with improved, metabolite-like properties.
https://doi.org/10.1142/9789812772435_0015
One of the growing challenges in life science research lies in finding useful, descriptive or quantitative data about newly reported biomolecules (genes, proteins, metabolites and drugs). An even greater challenge is finding information that connects these genes, proteins, drugs or metabolites to each other. Much of this information is scattered through hundreds of different databases, abstracts or books and almost none of it is particularly well integrated. While some efforts are being undertaken at the NCBI and EBI to integrate many different databases together, this still falls short of the goal of having some kind of human-readable synopsis that summarizes the state of knowledge about a given biomolecule – especially small molecules. To address this shortfall, we have developed BioSpider. BioSpider is essentially an automated report generator designed specifically to tabulate and summarize data on biomolecules – both large and small. Specifically, BioSpider allows users to type in almost any kind of biological or chemical identifier (protein/gene name, sequence, accession number, chemical name, brand name, SMILES string, InCHI string, CAS number, etc.) and it returns an in-depth synoptic report (~3-30 pages in length) about that biomolecule and any other biomolecule it may target. This summary includes physico-chemical parameters, images, models, data files, descriptions and predictions concerning the query molecule. BioSpider uses a web-crawler to scan through dozens of public databases and employs a variety of specially developed text mining tools and locally developed prediction tools to find, extract and assemble data for its reports. Because of its breadth, depth and comprehensiveness, we believe BioSpider will prove to be a particularly valuable tool for researchers in metabolomics. BioSpider is available at: http://www.biospider.ca.
https://doi.org/10.1142/9789812772435_0016
We recently developed two databases and a laboratory information system as resources for the metabolomics community. These tools are freely available and are intended to ease data analysis in both MS and NMR based metabolomics studies. The first database is a metabolomics extension to the BioMagResBank (BMRB, http://www.bmrb.wisc.edu), which currently contains experimental spectral data on over 270 pure compounds. Each small molecule entry consists of five or six one- and two-dimensional NMR data sets, along with information about the source of the compound, solution conditions, data collection protocol and the NMR pulse sequences. Users have free access to peak lists, spectra, and original time-domain data. The BMRB database can be queried by name, monoisotopic mass and chemical shift. We are currently developing a deposition tool that will enable people in the community to add their own data to this resource. Our second database, the Madison Metabolomics Consortium Database (MMCD, available from http://mmcd.nmrfam.wisc.edu/), is a hub for information on over 10,000 metabolites. These data were collected from a variety of sites with an emphasis on metabolites found in Arabidopsis. The MMC database supports extensive search functions and allows users to make bulk queries using experimental MS and/or NMR data. In addition to these databases, we have developed a new module for the Sesame laboratory information management system (http://www.sesame.wisc.edu) that captures all of the experimental protocols, background information, and experimental data associated with metabolomics samples. Sesame was designed to help coordinate research efforts in laboratories with high sample throughput and multiple investigators and to track all of the actions that have taken place in a particular study.
https://doi.org/10.1142/9789812772435_0017
Metabolomic databases are useless without accurate description of the biological study design and accompanying metadata reporting on the laboratory workflow from sample preparation to data processing. Here we report on the implementation of a database system that enables investigators to detail and set up a biological experiment, and that also steers laboratory workflows by direct access to the data acquisition instrument. SetupX utilizes orthogonal biological parameters such as genotype, organ, and treatment(s) for delineating the dimensions of a study which define the number of classes under investigation. Publicly available taxonomic and ontology repositories are utilized to ensure data integrity and logic consistency of class designs. Class descriptions are subsequently employed to schedule and randomize data acquisitions, and to deploy metabolite annotations carried out by the seamlessly integrated mass spectrometry database, BinBase. Annotated result data files are housed by SetupX for downloads and queries. Currently, 39 users have generated 48 studies, some of which are made public.
https://doi.org/10.1142/9789812772435_0018
Comparative metabolic profiling of cancerous and normal cells improves our understanding of the fundamental mechanisms of tumorigenesis and opens new opportunities in target and drug discovery. Here we report a novel methodology of comparative metabolome analysis integrating the information about both metabolite pools and fluxes associated with a large number of key metabolic pathways in model cancer and normal cell lines. The data were acquired using [U-13C] glucose labeling followed by two-dimensional NMR and GC-MS techniques and analyzed using isotopomer modeling approach. Significant differences revealed between breast cancer and normal human mammary epithelial cell lines are consistent with previously reported phenomena such as upregulation of fatty acid synthesis. Additional changes established for the first time in this study expand a remarkable picture of global metabolic rewiring associated with tumorigenesis and point to new potential diagnostic and therapeutic targets.
https://doi.org/10.1142/9789812772435_0019
With appropriate models, the metabolic profile of a biological system may be interrogated to obtain both significant discriminatory markers as well as mechanistic insight into the observed phenotype. One promising application is the analysis of drug toxicity, where a single chemical triggers multiple responses across cellular metabolism. Here, we describe a modeling framework whereby metabolite measurements are used to investigate the interactions between specialized cell functions through a metabolic reaction network. As a model system, we studied the hepatic transformation of troglitazone (TGZ), an antidiabetic drug withdrawn due to idiosyncratic hepatotoxicity. Results point to a well-defined TGZ transformation module that connects to other major pathways in the hepatocyte via amino acids and their derivatives. The quantitative significance of these connections depended on the nutritional state and the availability of the sulfur containing amino acids.
https://doi.org/10.1142/9789812772435_0020
No abstract received.
https://doi.org/10.1142/9789812772435_0021
We describe a natural language processing system (Enhanced SemRep) to identify core assertions on pharmacogenomics in Medline citations. Extracted information is represented as semantic predications covering a range of relations relevant to this domain. The specific relations addressed by the system provide greater precision than that achievable with methods that rely on entity co-occurrence. The development of Enhanced SemRep is based on the adaptation of an existing system and crucially depends on domain knowledge in the Unified Medical Language System. We provide a preliminary evaluation (55% recall and 73% precision) and discuss the potential of this system in assisting both clinical practice and scientific investigation.
https://doi.org/10.1142/9789812772435_0022
Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN, a system to automatically infer new GO annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. GEANN (i) extracts from text significant terms and phrases associated with a GO term, (ii) based on the extracted terms, constructs textual extraction patterns with reliability scores for GO terms, (iii) expands the pattern set through "pattern crosswalks", (iv) employs semantic pattern matching, rather than syntactic pattern matching, which allows for the recognition of phrases with close meanings, and (iv) annotates genes based on the "quality" of the matched pattern to the genomic entity occurring in the text. On the average, in our experiments, GEANN has reached to the precision level of 78% at the 57% recall level.
https://doi.org/10.1142/9789812772435_0023
There has been much work devoted to the mapping, alignment, and linking of ontologies (MALO), but little has been published about how to evaluate systems that do this. A fault model for conducting fine-grained evaluations of MALO systems is proposed, and its application to the system described in Johnson et al. [15] is illustrated. Two judges categorized errors according to the model, and inter-judge agreement was calculated by error category. Overall inter-judge agreement was 98% after dispute resolution, suggesting that the model is consistently applicable. The results of applying the model to the system described in [15] reveal the reason for a puzzling set of results in that paper, and also suggest a number of avenues and techniques for improving the state of the art in MALO, including the development of biomedical domain specific language processing tools, filtering of high frequency matching results, and word sense disambiguation.
https://doi.org/10.1142/9789812772435_0024
Applying Natural Language Processing techniques to biomedical text as a potential aid to curation has become the focus of intensive research. However, developing integrated systems which address the curators' real-world needs has been studied less rigorously. This paper addresses this question and presents generic tools developed to assist FlyBase curators. We discuss how they have been integrated into the curation workflow and present initial evidence about their effectiveness.
https://doi.org/10.1142/9789812772435_0025
There is extensive interest in mining data from full text. We have built a system called SLIF (for Subcellular Location Image Finder), which extracts information on one particular aspect of biology from a combination of text and images in journal articles. Associating the information from the text and image requires matching sub-figures with the sentences in the text. We introduce a stacked graphical model, a meta-learning scheme to augment a base learner by expanding features based on related instances, to match the labels of sub-figures with labels of sentences. The experimental results show a significant improvement in the matching accuracy of the stacked graphical model (81.3%) as compared with a relational dependency network (70.8%) or the current algorithm in SLIF (64.3%).
https://doi.org/10.1142/9789812772435_0026
Like the primary scientific literature, GeneRIFs exhibit both growth and obsolescence. NLM's control over the contents of the Entrez Gene database provides a mechanism for dealing with obsolete data: GeneRIFs are removed from the database when they are found to be of low quality. However, the rapid and extensive growth of Entrez Gene makes manual location of low-quality GeneRIFs problematic. This paper presents a system that takes advantage of the summary-like quality of GeneRIFs to detect low-quality GeneRIFs via a summary revision approach, achieving precision of 89% and recall of 77%. Aspects of the system have been adopted by NLM as a quality assurance mechanism.
https://doi.org/10.1142/9789812772435_0027
We have developed a challenge task for the second BioCreAtIvE (Critical Assessment of Information Extraction in Biology) that requires participating systems to provide lists of the EntrezGene (formerly LocusLink) identifiers for all human genes and proteins mentioned in a MEDLINE abstract. We are distributing 281 annotated abstracts and another 5,000 noisily annotated abstracts along with a gene name lexicon to participants. We have performed a series of baseline experiments to better characterize this dataset and form a foundation for participant exploration.
https://doi.org/10.1142/9789812772435_0028
The number of articles in the MEDLINE database is expected to increase tremendously in the coming years. To ensure that all these documents are indexed with continuing high quality, it is necessary to develop tools and methods that help the indexers in their daily task. We present three methods addressing a novel aspect of automatic indexing of the biomedical literature, namely producing MeSH main heading/subheading pair recommendations. The methods, (dictionary-based, post-processing rules and Natural Language Processing rules) are described and evaluated on a genetics-related corpus. The best overall performance is obtained for the subheading genetics (70% precision and 17% recall with post-processing rules, 48% precision and 37% recall with the dictionary-based method). Future work will address extending this work to all MeSH subheadings and a more thorough study of method combination.
https://doi.org/10.1142/9789812772435_0029
Text analytics is becoming an increasingly important tool used in biomedical research. While advances continue to be made in the core algorithms for entity identification and relation extraction, a need for practical applications of these technologies arises. We developed a system that allows users to explore the US Patent corpus using molecular information. The core of our system contains three main technologies: A high performing chemical annotator which identifies chemical terms and converts them to structures, a similarity search engine based on the emerging IUPAC International Chemical Identifier (InChI) standard, and a set of on demand data mining tools. By leveraging this technology we were able to rapidly identify and index 3, 623, 248 unique chemical structures from 4, 375, 036 US Patents and Patent Applications. Using this system a user may go to a web page, draw a molecule, search for related Intellectual Property (IP) and analyze the results. Our results prove that this is a far more effective way for identifying IP than traditional keyword based approaches.
https://doi.org/10.1142/9789812772435_0030
We propose an approach to predicting implicit gene-disease associations based on the inference network, whereby genes and diseases are represented as nodes and are connected via two types of intermediate nodes: gene functions and phenotypes. To estimate the probabilities involved in the model, two learning schemes are compared; one baseline using co-annotations of keywords and the other taking advantage of free text. Additionally, we explore the use of domain ontologies to complement data sparseness and examine the impact of full text documents. The validity of the proposed framework is demonstrated on the benchmark data set created from real-world data.
https://doi.org/10.1142/9789812772435_0031
The Internet is having a profound impact on physicians' medical decision making. One recent survey of 277 physicians showed that 72% of physicians regularly used the Internet to research medical information and 51% admitted that information from web sites influenced their clinical decisions. This paper describes the first cognitive evaluation of four state-of-the-art Internet search engines: Google (i.e., Google and Scholar.Google), MedQA, Onelook, and PubMed for answering definitional questions (i.e., questions with the format of "What is X?") posed by physicians. Onelook is a portal for online definitions, and MedQA is a question answering system that automatically generates short texts to answer specific biomedical questions. Our evaluation criteria include quality of answer, ease of use, time spent, and number of actions taken. Our results show that MedQA outperforms Onelook and PubMed in most of the criteria, and that MedQA surpasses Google in time spent and number of actions, two important efficiency criteria. Our results show that Google is the best system for quality of answer and ease of use. We conclude that Google is an effective search engine for medical definitions, and that MedQA exceeds the other search engines in that it provides users direct answers to their questions; while the users of the other search engines have to visit several sites before finding all of the pertinent information.
https://doi.org/10.1142/9789812772435_0032
No abstract received.
https://doi.org/10.1142/9789812772435_0033
Scientists working on genomics projects are often faced with the difficult task of sifting through large amounts of biological information dispersed across various online data sources that are relevant to their area or organism of research. Gene annotation, the process of identifying the functional role of a possible gene, in particular has become increasingly more time-consuming and laborious to conduct as more genomes are sequenced and the number of candidate genes continues to increase at near-exponential pace; genes are left un-annotated, or worse, incorrectly annotated. Many groups have attempted to address the annotation backlog through automated annotation systems that are geared toward specific organisms, and which may thus not possess the necessary flexibility and scalability to annotate other genomes. In this paper, we present a method and framework which attempts to address problems inherent in manual and automatic annotation by coupling a data integration system, BioMediator, to an inference engine with the aim of elucidating functional annotations. The framework and heuristics developed are not specific to any particular genome. We validated the method with a set of randomly-selected annotated sequences from a variety of organisms. Preliminary results show that the hybrid data integration and inference approach generates functional annotations that are as good as or better than "gold standard" annotations ~80% of the time.
https://doi.org/10.1142/9789812772435_0034
We describe a new publicly available algorithm for identifying absent sequences, and demonstrate its use by listing the smallest oligomers not found in the human genome (human "nullomers"), and those not found in any reported genome or GenBank sequence ("primes"). These absent sequences define the maximum set of potentially lethal oligomers. They also provide a rational basis for choosing artificial DNA sequences for molecular barcodes, show promise for species identification and environmental characterization based on absence, and identify potential targets for therapeutic intervention and suicide markers.
https://doi.org/10.1142/9789812772435_0035
Herein, we describe our ongoing efforts to develop a robust ontology for amphibian anatomy that accommodates the diversity of anatomical structures present in the group. We discuss the design and implementation of the project, current resolutions to issues we have encountered, and future enhancements to the ontology. We also comment on future efforts to integrate other data sets via this amphibian anatomical ontology.
https://doi.org/10.1142/9789812772435_0036
A common approach for identifying pathways from gene expression data is to cluster the genes without using prior information about a pathway, which often identifies only the dominant coexpression groups. Recommender systems are well-suited for using the known genes of a pathway to identify the appropriate experiments for predicting new members. However, existing systems, such as the GeneRecom-mender, ignore how genes naturally group together within specific experiments. We present a collaborative filtering approach which uses the pattern of how genes cluster together in different experiments to recommend new genes in a pathway. Clusters are first identified within a single experiment series. Informative clusters, in which the user-supplied query genes appear together, are identified. New genes that cluster with the known genes, in a significant fraction of the informative clusters, are recommended. We implemented a prototype of our system and measured its performance on hundreds of pathways. We find that our method performs as well as an established approach while significantly increasing the speed and scalability of searching large datasets. [Supplemental material is available online at http://sysbio.soe.ucsc.edu/cluegene/psb07.]
https://doi.org/10.1142/9789812772435_0037
Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, we present and evaluate the GoldenGATE editor, which is designed for the special needs of marking up OCR output with XML. It is built in order to support the user in this process as far as possible: Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties. Our evaluation shows that marking up an OCR document using GoldenGATE is three to four times faster than with an off-the-shelf XML editor like XML-Spy. Using domain-specific NLP-based plug-ins, these numbers are even higher.
https://doi.org/10.1142/9789812772435_0038
High-throughput proteomics is a rapidly developing field that offers the global profiling of proteins from a biological system. These high-throughput technological advances are fueling a revolution in biology, enabling analyses at the scale of entire systems (e.g., whole cells, tumors, or environmental communities). However, simply identifying the proteins in a cell is insufficient for understanding the underlying complexity and operating mechanisms of the overall system. Systems level investigations generating large-scale global data are relying more and more on computational analyses, especially in the field of proteomics.
https://doi.org/10.1142/9789812772435_0039
A major challenge in shotgun proteomics has been the assignment of identified peptides to the proteins from which they originate, referred to as the protein inference problem. Redundant and homologous protein sequences present a challenge in being correctly identified, as a set of peptides may in many cases represent multiple proteins. One simple solution to this problem is the assignment of the smallest number of proteins that explains the identified peptides. However, it is not certain that a natural system should be accurately represented using this minimalist approach. In this paper, we propose a reformulation of the protein inference problem by utilizing the recently introduced concept of peptide detectability. We also propose a heuristic algorithm to solve this problem and evaluate its performance on synthetic and real proteomics data. In comparison to a greedy implementation of the minimum protein set algorithm, our solution that incorporates peptide detectability performs favorably.
https://doi.org/10.1142/9789812772435_0040
The assumption on the mass error distribution of fragment ions plays a crucial role in peptide identification by tandem mass spectra. Previous mass error models are the simplistic uniform or normal distribution with empirically set parameter values. In this paper, we propose a more accurate mass error model, namely conditional normal model, and an iterative parameter learning algorithm. The new model is based on two important observations on the mass error distribution, i.e. the linearity between the mean of mass error and the ion mass, and the log-log linearity between the standard deviation of mass error and the peak intensity. To our knowledge, the latter quantitative relationship has never been reported before. Experimental results demonstrate the effectiveness of our approach in accurately quantifying the mass error distribution and the ability of the new model to improve the accuracy of peptide identification.
https://doi.org/10.1142/9789812772435_0041
Integrating diverse sources of interaction information to create protein networks requires strategies sensitive to differences in accuracy and coverage of each source. Previous integration approaches calculate reliabilities of protein interaction information sources based on congruity to a designated 'gold standard.' In this paper, we provide a comparison of the two most popular existing approaches and propose a novel alternative for assessing reliabilities which does not require a gold standard. We identify a new method for combining the resultant reliabilities and compare it against an existing method. Further, we propose an extrinsic approach to evaluation of reliability estimates, considering their influence on the downstream tasks of inferring protein function and learning regulatory networks from expression data. Results using this evaluation method show 1) our method for reliability estimation is an attractive alternative to those requiring a gold standard and 2) the new method for combining reliabilities is less sensitive to noise in reliability assignments than the similar existing technique.
https://doi.org/10.1142/9789812772435_0042
We describe a novel probabilistic approach to estimating errors in two-hybrid (2H) experiments. Such experiments are frequently used to elucidate protein-protein interaction networks in a high-throughput fashion; however, a significant challenge with these is their relatively high error rate, specifically, a high false-positive rate. We describe a comprehensive error model for 2H data, accounting for both random and systematic errors. The latter arise from limitations of the 2H experimental protocol: in theory, the reporting mechanism of a 2H experiment should be activated if and only if the two proteins being tested truly interact; in practice, even in the absence of a true interaction, it may be activated by some proteins – either by themselves or through promiscuous interaction with other proteins. We describe a probabilistic relational model that explicitly models the above phenomenon and use Markov Chain Monte Carlo (MCMC) algorithms to compute both the probability of an observed 2H interaction being true as well as the probability of individual proteins being self-activating/promiscuous. This is the first approach that explicitly models systematic errors in protein-protein interaction data; in contrast, previous work on this topic has modeled errors as being independent and random. By explicitly modeling the sources of noise in 2H systems, we find that we are better able to make use of the available experimental data. In comparison with Bader et al.'s method for estimating confidence in 2H predicted interactions, the proposed method performed 5-10% better overall, and in particular regimes improved prediction accuracy by as much as 76%.
https://doi.org/10.1142/9789812772435_0043
MALDI-based Imaging Mass Spectrometry (IMS) is an analytical technique that provides the opportunity to study the spatial distribution of biomolecules including proteins and peptides in organic tissue. IMS measures a large collection of mass spectra spread out over an organic tissue section and retains the absolute spatial location of these measurements for analysis and imaging. The classical approach to IMS imaging, producing univariate ion images, is not well suited as a first step in a prospective study where no a priori molecular target mass can be formulated. The main reasons for this are the size and the multivariate nature of IMS data. In this paper we describe the use of principal component analysis as a multivariate pre-analysis tool, to identify the major spatial and mass-related trends in the data and to guide further analysis downstream. First, a conceptual overview of principal component analysis for IMS is given. Then, we demonstrate the approach on an IMS data set collected from a transversal section of the spinal cord of a standard control rat.
https://doi.org/10.1142/9789812772435_0044
No abstract received.
https://doi.org/10.1142/9789812772435_0045
We introduce a new motif-discovery algorithm, DIMDom, which exploits two additional kinds of information not commonly exploited: (a) the characteristic pattern of binding site classes, where class is determined based on biological information about transcription factor domains and (b) posterior probabilities of these classes. We compared the performance of DIMDom with MEME on all the transcription factors of Drosophila with at least one known binding site in the TRANSFAC database and found that DOMDom outperformed MEME with 2.5 times the number of successes and 1.5 times in the accuracy in finding binding sties and motifs.
https://doi.org/10.1142/9789812772435_0046
Transcription factors are DNA-binding proteins that control gene transcription by binding specific short DNA sequences. Experiments that identify transcription factor binding sites are often laborious and expensive, and the binding sites of many transcription factors remain unknown. We present a computational scheme to predict the binding sites directly from transcription factor sequence using all-atom molecular simulations. This method is a computational counterpart to recent high-throughput experimental technologies that identify transcription factor binding sites (ChlP-chip and protein-dsDNA binding microarrays). The only requirement of our method is an accurate 3D structural model of a transcription factor–DNA complex. We apply free energy calculations by thermodynamic integration to compute the change in binding energy of the complex due to a single base pair mutation. By calculating the binding free energy differences for all possible single mutations, we construct a position weight matrix for the predicted binding sites that can be directly compared with experimental data. As water-bridged hydrogen bonds between the transcription factor and DNA often contribute to the binding specificity, we include explicit solvent in our simulations. We present successful predictions for the yeast MAT-α2 homeodomain and GCN4 bZIP proteins. Water-bridged hydrogen bonds are found to be more prevalent than direct protein-DNA hydrogen bonds at the binding interfaces, indicating why empirical potentials with implicit water may be less successful in predicting binding. Our methodology can be applied to a variety of DNA-binding proteins.
https://doi.org/10.1142/9789812772435_0047
Template-based comparative analysis is a viable approach to the prediction and annotation of pathways in genomes. Methods based solely on sequence similarity may not be effective enough; functional and structural information such as protein-DNA interactions and operons can prove useful in improving the prediction accuracy. In this paper, we present a novel approach to predicting pathways by seeking high overall sequence similarity, functional and structural consistency between the predicted pathways and their templates. In particular, the prediction problem is formulated into finding the maximum independent set (MIS) in the graph constructed based on operon or interaction structures as well as homologous relationships of the involved genes. On such graphs, the MIS problem is solved efficiently via non-trivial tree decomposition of the graphs. The developed algorithm is evaluated based on the annotation of 40 pathways in Escherichia coli (E. coli) K12 using those in Bacillus subtilis (B. subtilis) 168 as templates. It demonstrates overall accuracy that outperforms those of the methods based solely on sequence similarity or using structural information of the genome with integer programming.
Sample Chapter(s)
Chapter 1: Protein Interactions and Disease (106k)