![]() |
The Pacific Symposium on Biocomputing (PSB) 2009 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2009 will be held on January 5–9, 2009 in Kamuela, Hawaii. Tutorials will be offered prior to the start of the conference.
PSB 2009 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.
The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's “hot topics.” In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field.
https://doi.org/10.1142/9789812836939_fmatter
The following sections are included:
https://doi.org/10.1142/9789812836939_0001
The following sections are included:
https://doi.org/10.1142/9789812836939_0002
In order to better characterize the behavior of biochemical systems, it is sometimes helpful and necessary to introduce time-dependent input signals. If the state of a biochemical system with such signals is assumed to evolve deterministically and continuously, then it can be readily analyzed by solving ordinary differential equations. However, if it assumed to evolve discretely and stochastically, then existing simulation methods cannot be applied. In this paper, we incorporate conditions for transient analysis into stochastic simulation and we develop the corresponding simulation algorithm. Applying our method to examples, we demonstrate that it can yield new insights into the dynamics of biochemical systems; specifically, it can be used to verify the design of biochemical logic gates.
https://doi.org/10.1142/9789812836939_0003
Gene networks are important tools in studying gene-gene relationships and gene function. Understanding the relationships within these networks is an important challenge. Ontologies are a critical tool in helping deal with these data. The use of the Gene Ontology, for example, has become routine in methods for validation, discovery, etc. Here we present a novel algorithm that synthesizes an ontology by considering both extant annotation terms and also the connections between genes in gene networks. The process is efficient and produces easily inspectable ontologies. Because the relationships drawn between terms are heavily influenced by data, we call these "Data-Driven" Ontologies. We apply this algorithm to both discover new relationships between biological processes and as a tool to compare sets of genes across microrarray experiments. Supplemental data and source code are available at: http://www.ddont.org.
https://doi.org/10.1142/9789812836939_0004
There is a strong clinical imperative to identify discerning molecular biomarkers of disease to inform diagnosis, prognosis, and treatment. Ideally, such biomarkers would be drawn from peripheral sources non-invasively to reduce costs and lower potential for complication. Advances in high-throughput genomics and proteomics have vastly increased the space of prospective molecular biomarkers. Consequently, the elucidation of molecular biomarkers of clinical importance often entails a genome- or proteome-wide search for candidates. Here we present a novel framework for the identification of disease-specific protein biomarkers through the integration of biofluid proteomes and inter-disease genomic relationships using a network paradigm. We created a blood plasma biomarker network by linking expression-based genomic profiles from 136 diseases to 1,028 detectable blood plasma proteins. We also created a urine biomarker network by linking genomic profiles from 127 diseases to 577 proteins detectable in urine. Through analysis of these molecular biomarker networks, we find that the majority (> 80%) of putative protein biomarkers are linked to multiple disease conditions. Thus, prospective disease-specific protein biomarkers are found in only a small subset of the biofluids proteomes. These findings illustrate the importance of considering shared molecular pathology across diseases when evaluating biomarker specificity. The proposed framework is amenable to integration with complimentary network models of biology, which could further constrain the biomarker candidate space, and establish a role for the understanding of multi-scale, inter-disease genomic relationships in biomarker discovery.
https://doi.org/10.1142/9789812836939_0005
Modeling and analyzing protein-protein interaction (PPI) networks is an important problem in systems biology. Many random graph models were proposed to capture specific network properties or mimic the way real PPI networks might have evolved. In this paper we introduce a new generative model for PPI networks which is based on geometric random graphs and uses the whole connectivity information of the real PPI networks to learn their structure. Using only the high confidence part of yeast S. cerevisiae PPI network for training our new model, we successfully reproduce structural properties of other lower-confidence yeast, as well as of human PPI networks coming from different data sources. Thus, our new approach allows us to utilize high quality parts of currently available PPI data to create accurate models for PPI networks of different species.
https://doi.org/10.1142/9789812836939_0006
Analysis of condition-specific behavior under stressful environmental conditions can provide insight into mechanisms causing different healthy and diseased cellular states. Functional networks (edges representing statistical dependencies) inferred from condition-specific expression data can provide fine-grained, network level information about conserved and specific behavior across different conditions. In this paper, we examine novel microarray compendia measuring gene expression from two unique stationary phase yeast cell populations, quiescent and non-quiescent. We make the following contributions: (a) develop a new algorithm to infer functional networks modeled as undirected probabilistic graphical models, Markov random fields, (b) infer functional networks for quiescent, non-quiescent cells and exponential cells, and (c) compare the inferred networks to identify processes common and different across these cells. We found that both non-quiescent and exponential cells have more gene ontology enrichment than quiescent cells. The exponential cells share more processes with non-quiescent than with quiescent, highlighting the novel and relatively under-studied characteristics of quiescent cells. Analysis of inferred subgraphs identified processes enriched in both quiescent and non-quiescent cells as well processes specific to each cell type. Finally, SNF1, which is crucial for quiescence, occurs exclusively among quiescent network hubs, while non-quiescent network hubs are enriched in human disease causing homologs.
https://doi.org/10.1142/9789812836939_0007
Bayesian network structure learning is a useful tool for elucidation of regulatory structures of biomolecular pathways. The approach however is limited by its acyclicity constraint, a problematic one in the cycle-containing biological domain. Here, we introduce a novel method for modeling cyclic pathways in biology, by employing our newly introduced Generalized Bayesian Networks (GBNs). Our novel algorithm enables cyclic structure learning while employing biologically relevant data, as it extends our cycle-learning algorithm to permit learning with singly perturbed samples. We present theoretical arguments as well as structure learning results from realistic, simulated data of a biological system. We also present results from a real world dataset, involving signaling pathways in T-cells.
https://doi.org/10.1142/9789812836939_0008
Learning or inferring networks of genomic regulation specific to a cellular state, such as a subtype of tumor, can yield insight above and beyond that resulting from network-learning techniques which do not acknowledge the adaptive nature of the cellular system. In this study we show that Cellular Context Mining, which is based on a mathematical model of contextual genomic regulation, produces gene regulatory networks (GRNs) from steady-state expression microarray data which are specific to the varying cellular contexts hidden in the data; we show that these GRNs not only model gene interactions, but that they are also readily annotated with context-specific genomic information. We propose that these context-specific GRNs provide advantages over other techniques, such as clustering and Bayesian networks, when applied to gene expression data of cancer patients.
https://doi.org/10.1142/9789812836939_0009
Curated biological knowledge of interactions and pathways is largely available from various databases, and network synthesis is a popular method to gain insight into the data. However, such data from curated databases presents a single view of the knowledge to the biologists, and it may not be suitable to researchers' specific needs. On the other hand, Medline abstracts are publicly accessible and encode the necessary information to synthesize different kinds of biological networks. In this paper, we propose a new paradigm in synthesizing biomolecular networks by allowing biologists to create their own networks through queries to a specialized database of Medline abstracts. With this approach, users can specify precisely what kind of information they want in the resulting networks. We demonstrate the feasibility of our approach in the synthesis of gene-drug, gene-disease and protein-protein interaction networks. We show that our approach is capable of synthesizing these networks with high precision and even finds relations that have yet to be curated in public databases. In addition, we demonstrate a scenario of recovering a drug-related pathway using our approach.
https://doi.org/10.1142/9789812836939_0010
A number of tools for the alignment of protein-protein interaction (PPI) networks have laid the foundation for PPI network analysis. They typically find conserved interaction patterns by various local or global search algorithms, and then validate the results using genome annotation. The improvement of the speed, scalability and accuracy of network alignment is still the target of ongoing research. In view of this, we introduce a connected-components based algorithm, called HopeMap for pairwise network alignment with the focus on fast identification of maximal conserved patterns across species. Observing that the number of true homologs across species is relatively small compared to the total number of proteins in all species, we start with highly homologous groups across species, find maximal conserved interaction patterns globally with a generic scoring system, and validate the results across multiple known functional annotations. The results are evaluated in terms of statistical enrichment of gene ontology (GO) terms and KEGG ortholog groups (KO) within conserved interaction patters. HopeMap is fast, with linear computational cost, accurate in terms of KO groups and GO terms specificity and sensitivity, and extensible to multiple network alignment.
https://doi.org/10.1142/9789812836939_0011
Comparative methods have long been a mainstay of biology, particularly evolutionary biology; they are also at the core of medical research based on animal models of human physiology. They find their most challenging and most fitting application, however, in the study of whole genomes, as they are the main tools through which we can make sense of the billions of base-pairs forming the sequence of animal and other genomes. Comparing whole genomes, which is necessarily done through computational methods due to the size of the genomes, has given rise to the research area known as comparative genomics…
https://doi.org/10.1142/9789812836939_0012
In this paper we use the length of the shared synteny between genes to identify "parent" orthologs among multiple lineage specific duplicated genes. Genes in the region around each duplicated paralog are compared with the genes flanking an outgroup ortholog to estimate the probability of observing homologs in syntenic vs. non-syntenic regions. The length of the shared synteny is introduced as a hidden variable and is estimated using Expectation-Maximization for each lineage specific paralog. Assuming that the original, parental gene will preserve the longest synteny with the outgroup gene, and that any daughter genes will have a shorter syntenic block, we are able to determine parent-daughter relationships. We apply this method to lineage specific duplications in the human genome, and show that we are able to determine the direction and size of the duplication events that have created hundreds of genes.
https://doi.org/10.1142/9789812836939_0013
Segmental duplications are abundant in the human genome, but their evolutionary history is not well-understood. The mystery surrounding them is due in part to their complex organization; many segmental duplications are mosaic patterns of smaller repeated segments, or duplicons. A two-step model of duplication has been proposed to explain these mosaic patterns. In this model, duplicons are copied and aggregated into primary duplication blocks that subsequently seed secondary duplications. Here, we formalize the problem of computing a duplication scenario that is consistent with the two-step model. We first describe a dynamic programming algorithm to compute the duplication distance between two strings. We then use this distance as the cost function in an integer linear program to obtain the most parsimonious duplication scenario. We apply our method to derive putative ancestral relationships between segmental duplications in the human genome.
https://doi.org/10.1142/9789812836939_0014
The "double-cut-and-join" (DCJ) model of genome rearrangement proposed by Yancopoulos et al. uses the single DCJ operation to account for all genome rearrangement events. Given three signed permutations, the DCJ median problem is to find a fourth permutation that minimizes the sum of the pairwise DCJ distances between it and the three others. In this paper, we present a branch-and-bound method that provides accurate solution to the multichromosomal DCJ median problems. We conduct extensive simulations and the results show that the DCJ median solver performs better than other median solvers for most of the test cases. These experiments also suggest that DCJ model is more suitable for real datasets where both reversals and transpositions occur.
https://doi.org/10.1142/9789812836939_0015
Genetic recombination plays two essential biological roles. It ensures the fidelity of the transmission of genetic information from one generation to the next and it generates new combinations of genetic variants. Therefore, recombination is a critical process in shaping arrangement of polymorphisms within populations. "Recombination breakpoints" in a given set of genomes from individuals in a population divide the genome into haplotype blocks, resulting in a mosaic structure on the genome. In this paper, we study the Minimum Mosaic Problem: given a set of genome sequences from individuals within a population, compute a mosaic structure containing the minimum number of breakpoints. This mosaic structure provides a good estimation of the minimum number of recombination events (and their location) required to generate the existing haplotypes in the population. We solve this problem by finding the shortest path in a directed graph. Our algorithm's efficiency permits genome-wide analysis.
https://doi.org/10.1142/9789812836939_0016
Genomic intervals that contain a cluster of similar genes are of extreme biological interest, but difficult to sequence and analyze. One goal for interspecies comparisons of such intervals is to reconstruct a parsimonious series of duplications, deletions, and speciation events (a putative evolutionary history) that could have created the contemporary clusters from their last common ancestor. We describe a new method for reconstructing such an evolutionary scenario for a given set of intervals from present-day genomes, based on the statistical technique of Sequential Importance Sampling. An implementation of the method is evaluated using (1) artificial datasets generated by simulating the operations of duplication, deletion, and speciation starting with featureless "ancestral" sequences, and (2) by comparing the inferred evolutionary history of the amino-acid sequences for the CYP2 gene family from human chromosome 19, chimpanzee, orangutan, rhesus macaque, and dog, as computed by a standard phylogenetic-tree reconstruction method.
https://doi.org/10.1142/9789812836939_0017
The following sections are included:
https://doi.org/10.1142/9789812836939_0018
Understanding evolutionary dynamics from a systemic point of view crucially depends on knowledge about how evolution affects size and structure of the organisms' functional building blocks (modules). It has been recently reported that statistics over sparse PPI graphlets can robustly monitor such evolutionary changes. However, there is abundant evidence that in PPI networks modules can be identified with highly interconnected (dense) and/or bipartite sub-graphs. We count such dense graphlets in PPI networks by employing recently developed search strategies that render related inference problems tractable. We demonstrate that corresponding counting statistics differ significantly between prokaryotes and eukaryotes as well as between "real" PPI networks and scale free network emulators. We also prove that another class of emulators, the low-dimensional geometric random graphs (GRGs) cannot contain a specific type of motifs, complete bipartite graphs, which are abundant in PPI networks.
https://doi.org/10.1142/9789812836939_0019
Protein interaction network analyses have moved beyond simple topological observations to functional and evolutionary inferences based on the construction of putative ancestral networks. Evolutionary studies of protein interaction networks are generally derived from network comparisons, are limited in scope, or are theoretic dynamic models that aren't contextualized to an organism's specific genes. A biologically faithful network evolution reconstruction which ties evolution of the network itself to the actual genes of an organism would help fill in the evolutionary gaps between the gene network "snapshots" of evolution we have from different species today. Here we present a novel framework for reverse engineering the evolution of protein interaction networks of extant species using phylogenetic gene trees and protein interaction data. We applied the framework to Saccharomyces cerevisiae data and present topological trends in the evolutionary lineage of yeast.
https://doi.org/10.1142/9789812836939_0020
Despite the rapid accumulation of systems-level biological data, understanding the dynamic nature of cellular activity remains a difficult task. The reason is that most biological data are static, or only correspond to snapshots of cellular activity. In this study, we explicitly attempt to detangle the temporal complexity of biological networks by using compilations of time-series gene expression profiling data. We define a dynamic network module to be a set of proteins satisfying two conditions: (1) they form a connected component in the protein-protein interaction (PPI) network; and (2) their expression profiles form certain structures in the temporal domain. We develop an efficient mining algorithm to discover dynamic modules in a temporal network. Using yeast as a model system, we demonstrate that the majority of the identified dynamic modules are functionally homogeneous. Additionally, many of them provide insight into the sequential ordering of molecular events in cellular systems. Finally, we note that the applicability of our algorithm is not limited to the study of PPI networks, instead it is generally applicable to the combination of any type of network and time-series data.
https://doi.org/10.1142/9789812836939_0021
Two-Hybrid (Y2H) Protein-Protein interaction (PPI) data suffer from high False Positive and False Negative rates, thus making searching for protein complexes in PPI networks a challenge. To overcome these limitations, we propose an efficient approach which measures connectivity between proteins not by edges, but by edge-disjoint paths. We model the number of edge-disjoint paths as a network flow and efficiently represent it in a Gomory-Hu tree. By manipulating the tree, we are able to isolate groups of nodes sharing more edge-disjoint paths with each other than with the rest of the network, which are our putative protein complexes. We examine the performance of our algorithm with Variation of Information and Separation measures and show that it belongs to a group of techniques which are robust against increased false positive and false negative rates. We apply our approach to yeast, mouse, worm, and human Y2H PPI networks, where it shows promising results. On yeast network, we identify 38 statistically significant protein clusters, 20 of which correspond to protein complexes and 16 to functional modules.
https://doi.org/10.1142/9789812836939_0022
The aim of this paper is to demonstrate the potential power of large-scale particle filtering for the parameter estimations of in silico biological pathways where time course measurements of biochemical reactions are observable. The method of particle filtering has been a popular technique in the field of statistical science, which approximates posterior distributions of model parameters of dynamic system by using sequentially-generated Monte Carlo samples. In order to apply the particle filtering to system identifications of biological pathways, it is often needed to explore the posterior distributions which are defined over an exceedingly high-dimensional parameter space. It is then essential to use a fairly large amount of Monte Carlo samples to obtain an approximation with a high-degree of accuracy. In this paper, we address some implementation issues on large-scale particle filtering, and then, indicate the importance of large-scale computing for parameter learning of in silico biological pathways. We have tested the ability of the particle filtering with 108 Monte Carlo samples on the transcription circuit of circadian clock that contains 45 unknown kinetic parameters. The proposed approach could reveal clearly the shape of the posterior distributions over the 45 dimensional parameter space.
https://doi.org/10.1142/9789812836939_0023
Modeling in biology is mainly grounded in mathematics, and specifically on ordinary differential equations (ODE). The programming language approach is a complementary and emergent tool to analyze the dynamics of biological networks. Here we focus on BlenX showing how it is possible to easily re-use ODE models within this framework. A budding yeast cell cycle example demonstrates the advantages of using a stochastic approach. Finally, some hints are provided on how the automatically translated model can take advantage of the full power of BlenX to analyze the control mechanisms of the cell cycle machinery.
https://doi.org/10.1142/9789812836939_0024
Some drugs affect secretion of secreted proteins (e.g. cytokines) released from target cells, but it remains unclear whether these proteins act in an autocrine manner and directly effect the cells on which the drugs act. In this study, we propose a computational method for testing a biological hypothesis: there exist autocrine signaling pathways that are dynamically regulated by drug response transcriptome networks and control them simultaneously. If such pathways are identified, they could be useful for revealing drug mode-of-action and identifying novel drug targets. By the node-set separation method proposed, dynamic structural changes can be embedded in transcriptome networks that enable us to find master-regulator genes or critical paths at each observed time. We then combine the protein-protein interaction network with the estimated dynamic transcriptome network to discover drug-affected autocrine pathways if they exist. The statistical significance (p-values) of the pathways are evaluated by the meta-analysis technique. The dynamics of the interactions between the transcriptome networks and the signaling pathways will be shown in this framework. We illustrate our strategy by an application using anti-hyperlipidemia drug, Fenofibrate. From over one million protein-protein interaction pathways, we extracted significant 23 autocrine-like pathways with the Bonferroni correction, including VEGF–NRP1–GIPC1–PRKCA–PPARα, that is one of the most significant ones and contains PPARα, a target of Fenofibrate.
https://doi.org/10.1142/9789812836939_0025
A key role of signal transduction pathways is to control transcriptional programs in the nucleus as a function of signals received by the cell via complex post-translational modification cascades. This determines cell-context specific responses to environmental stimuli. Given the difficulty of quantitating protein concentration and post-translational modifications, signaling pathway studies are still for the most part conducted one interaction at the time. Thus, genome-wide, cell-context specific dissection of signaling pathways is still an open challenge in molecular systems biology.
In this manuscript we extend the MINDy algorithm for the identification of post-translational modulators of transcription factor activity, to produce a first genome-wide map of the interface between signaling and transcriptional regulatory programs in human B cells. We show that the serine-threonine kinase STK38 emerges as the most pleiotropic signaling protein in this cellular context and we biochemically validate this finding by shRNA-mediated silencing of this kinase, followed by gene expression profile analysis. We also extensively validate the inferred interactions using protein-protein interaction databases and the kinase-substrate interaction prediction algorithm NetworKIN.
https://doi.org/10.1142/9789812836939_0026
The following sections are included:
https://doi.org/10.1142/9789812836939_0027
This article describes a numerical solution of the steady-state Poisson-Boltzmann-Smoluchowski (PBS) and Poisson-Nemst-Planck (PNP) equations to study diffusion in biomolecular systems. Specifically, finite element methods have been developed to calculate electrostatic interactions and ligand binding rate constants for large biomolecules. The resulting software has been validated and applied to the wild-type and several mutated avian influenza neurominidase crystal structures. The calculated rates show very good agreement with recent experimental studies. Furthermore, these finite element methods require significantly fewer computational resources than existing particle-based Brownian dynamics methods and are robust for complicated geometries. The key finding of biological importance is that the electrostatic steering plays the important role in the drug binding process of the neurominidase.
https://doi.org/10.1142/9789812836939_0028
The overall goal of this study was to assess the mechanistic fidelity of continuum-level finite element models of the vertebral body, which represent a promising tool for understanding and predicting clinical fracture risk. Two finite element (FE) models were generated from micro-CT scans of each of 13 T9 vertebral bodies — a micro-FE model at 60-micron resolution and a coarsened, continuum-level model at 0.96-mm resolution. Two previously-reported continuum-level modulus-density relationships for human vertebral bone were parametrically varied to investigate their effects on model fidelity using the micro-CT models as a gold standard. We found that the modulus-density relation, particularly that assigned to the peripheral bone, substantially altered the regression coefficients, but not the degree of correlation between continuum and micro-FE predictions of whole-vertebral stiffness. The major load paths through the vertebrae compared well between the continuum-level and micro-FE models (von-Mises distribution), but the distributions of minimum principal strain were notably different. We conclude that continuum-level models provide robust measures of whole-vertebral behavior, describe well the load transfer paths through the vertebra, but provide strain distributions that are markedly different than the volume-averaged micro-scale strains. Appreciation of these multi-scale differences should improve interpretation of results from these sorts of continuum models and may improve their clinical utility.
https://doi.org/10.1142/9789812836939_0029
As a case-study of biosimulation model integration, we describe our experiences applying the SemSim methodology to integrate independently-developed, multiscale models of cardiac circulation. In particular, we have integrated the CircAdapt model (written by T. Arts for MATLAB) of an adapting vascular segment with a cardiovascular system model (written by M. Neal for JSim). We report on three results from the model integration experience. First, models should be explicit about simulations that occur on different time scales. Second, data structures and naming conventions used to represent model variables may not translate across simulation languages. Finally, identifying the dependencies among model variables is a non-trivial task. We claim that these challenges will appear whenever researchers attempt to integrate models from others, especially when those models are written in a procedural style (using MATLAB, Fortran, etc.) rather than a declarative format (as supported by languages like SBML, CellML or JSim's MML).
https://doi.org/10.1142/9789812836939_0030
Multiscale modeling has emerged as a powerful approach to interpret and capitalize on the biological complexity underlying blood vessel growth. We present a multiscale model of angiogenesis that heralds the start of a large scale initiative to integrate related biological models. The goal of the integrative project is to better understand underlying biological mechanisms from the molecular level up through the organ systems level, and test new therapeutic strategies. Model methodology includes ordinary and partial differential equations, stochastic models, complex logical rules, and agent-based architectures. Current modules represent blood flow, oxygen transport, growth factor distribution and signaling, cell sensing, cell movement and cell proliferation. Challenges of integration lie in connecting modules that are diversely designed, seamlessly coordinating feedback, and representing spatial and time scales from ligand-receptor interactions and intracellular signaling, to cell-level movement and cell-matrix interactions, to vessel branching and capillary network formation, to tissue level characteristics, to organ system response. We briefly introduce the individual modules, discuss our approach to integration, present initial results from the coordination of modules, and propose solutions to some critical issues facing angiogenesis multiscale modeling and integration.
https://doi.org/10.1142/9789812836939_0031
Computational models of excitation-contraction (EC) coupling in myocytes are valuable tools for studying the signaling cascade that transduces transmembrane voltage into mechanical responses. A key component of these models is the appropriate description of structures involved in EC coupling, such as the sarcolemma and ion channels. This study aims at developing an approach for spatial reconstruction of these structures. We exemplified our approach by reconstructing clusters of ryanodine receptors (RyRs) together with the sarcolemma of rabbit ventricular myocytes. The reconstructions were based on dual labeling and three-dimensional (3D) confocal imaging of segments of fixed and permeabilized myocytes lying flat or on end. The imaging led to 3D stacks of cross-sections through myocytes. Methods of digital image processing were applied to deconvolve, filter and segment these stacks. Finally, we created point meshes representing RyR distributions together with volume and surface meshes of the sarcolemma. We suggest that these meshes are suitable for computational studies of structure-function relationships in EC coupling. We propose that this approach can be extended to reconstruct other structures and proteins involved in EC coupling.
https://doi.org/10.1142/9789812836939_0032
We present a new multiscale method that combines all-atom molecular dynamics with coarse-grained sampling, towards the aim of bridging two levels of physiology: the atomic scale of protein side chains and small molecules, and the huge scale of macromolecular complexes like the ribosome. Our approach uses all-atom simulations of peptide (or other ligand) fragments to calculate local 3D spatial potentials of mean force (PMF). The individual fragment PMFs are then used as a potential for a coarse-grained chain representation of the entire molecule. Conformational space and sequence space are sampled efficiently using generalized ensemble Monte Carlo. Here, we apply this method to the study of nascent polypeptides inside the cavity of the ribosome exit tunnel. We show how the method can be used to explore the accessible conformational and sequence space of nascent polypeptide chains near the ribosome peptidyl transfer center (PTC), with the eventual aim of understanding the basis of specificity for co-translational regulation. The method has many potential applications to predicting binding specificity and design, and is sufficiently general to allow even greater separation of scales in future work.
https://doi.org/10.1142/9789812836939_0033
The following sections are included:
https://doi.org/10.1142/9789812836939_0034
Stem cells represent not only a potential source of treatment for degenerative diseases but can also shed light on developmental biology and cancer. It is believed that stem cells differentiation and fate is triggered by a common genetic program that endows those cells with the ability to differentiate into specialized progenitors and fully differentiated cells. To extract the stemness signature of several cells types at the transcription level, we integrated heterogeneous datasets (microarray experiments) performed in different adult and embryonic tissues (liver, blood, bone, prostate and stomach in Homo sapiens and Mus musculus). Data were integrated by generalization of the hematopoietic stem cell hierarchy and by homology between mouse and human. The variation-filtered and integrated gene expression dataset was fed to a single-layered neural network to create a classifier to (i) extract the stemness signature and (ii) characterize unknown stem cell tissue samples by attribution of a stem cell differentiation stage. We were able to characterize mouse stomach progenitor and human prostate progenitor samples and isolate gene signatures playing a fundamental role for every level of the generalized stem cell hierarchy.
https://doi.org/10.1142/9789812836939_0035
Genome-wide association studies provide an unprecedented opportunity to identify combinations of genetic variants that contribute to disease susceptibility. The combinatorial problem of jointly analyzing the millions of genetic variations accessible by high-throughput genotyping technologies is a difficult challenge. One approach to reducing the search space of this variable selection problem is to assess specific combinations of genetic variations based on prior statistical and biological knowledge. In this work, we provide a systematic approach to integrate multiple public databases of gene groupings and sets of disease-related genes to produce multi-SNP models that have an established biological foundation. This approach yields a collection of models which can be tested statistically in genome-wide data, along with an ordinal quantity describing the number of data sources that support any given model. Using this knowledge-driven approach reduces the computational and statistical burden of large-scale interaction analysis while simultaneously providing a biological foundation for the relevance of any significant statistical result that is found.
https://doi.org/10.1142/9789812836939_0036
Several influential studies of genotypic determinants of gene expression in humans have now been published based on various populations including HapMap cohorts. The magnitude of the analytic task (transcriptome vs. SNP-genome) is a hindrance to dissemination of efficient, thorough, and auditable inference methods for this project. We describe the structure and use of Bioconductor facilities for inference in genetics of gene expression, with simultaneous application to multiple HapMap cohorts. Tools distributed for this purpose are readily adapted for the structure and analysis of privately-generated data in expression genetics.
https://doi.org/10.1142/9789812836939_0037
Research in model organisms relies on unspoken assumptions about the conservation of protein-protein interactions across species, yet several analyses suggest such conservation is limited. Fortunately, for many purposes the crucial issue is not global conservation of interactions, but preferential conservation of functionally important ones. An observed bias towards essentiality in highly-connected proteins implies the functional importance of such "hubs". We therefore define the notion of degree-conservation and demonstrate that hubs are preferentially degree-conserved. We show that a protein is more likely to be a hub if it has a high-degree ortholog, and that once a protein becomes a hub, it tends to remain so. We also identify a positive correlation between the average degree of a protein and the conservation of its interaction partners, and we find that the conservation of individual hub interactions is surprisingly high. Our work has important implications for prediction of protein function, computational inference of PPIs, and interpretation of data from model organisms.
https://doi.org/10.1142/9789812836939_0038
This paper describes the design and applications of Unison, a comprehensive and integrated warehouse of protein sequences, diverse precomputed predictions, and other biological data. Unison provides a practical solution to the burden of preparing data for computational discovery projects, enables holistic feature-based mining queries regarding protein composition and functions, and provides a foundation for the development of new tools. Unison is available for immediate use online via direct database connections and a web interface. In addition, the database schema, command line tools, web interface, and non-proprietary precomputed predictions are released under the Academic Free License and available for download at http://unison-db.org/. This project has resulted in a system that significantly reduces several practical impediments to the initiation of computational biology discovery projects.
https://doi.org/10.1142/9789812836939_0039
The goal of genome wide association (GWA) mapping in modern genetics is to identify genes or narrow regions in the genome that contribute to genetically complex phenotypes such as morphology or disease. Among the existing methods, tree-based association mapping methods show obvious advantages over single marker-based and haplotype-based methods because they incorporate information about the evolutionary history of the genome into the analysis. However, existing tree-based methods are designed primarily for binary phenotypes derived from case/control studies or fail to scale genome-wide.
In this paper, we introduce TreeQA, a quantitative GWA mapping algorithm. TreeQA utilizes local perfect phylogenies constructed in genomic regions exhibiting no evidence of historical recombination. By efficient algorithm design and implementation, TreeQA can efficiently conduct quantitative genom-wide association analysis and is more effective than the previous methods. We conducted extensive experiments on both simulated datasets and mouse inbred lines to demonstrate the efficiency and effectiveness of TreeQA.
https://doi.org/10.1142/9789812836939_0040
Identifying and validating biomarkers from high-throughput gene expression data is important for understanding and treating cancer. Typically, we identify candidate biomarkers as features that are differentially expressed between two or more classes of samples. Many feature selection metrics rely on ranking by some measure of differential expression. However, interpreting these results is difficult due to the large variety of existing algorithms and metrics, each of which may produce different results. Consequently, a feature ranking metric may work well on some datasets but perform considerably worse on others. We propose a method to choose an optimal feature ranking metric on an individual dataset basis. A metric is optimal if, for a particular dataset, it favorably ranks features that are known to be relevant biomarkers. Extensive knowledge of biomarker candidates is available in public databases and literature. Using this knowledge, we can choose a ranking metric that produces the most biologically meaningful results. In this paper, we first describe a framework for assessing the ability of a ranking metric to detect known relevant biomarkers. We then apply this method to clinical renal cancer microarray data to choose an optimal metric and identify several candidate biomarkers.
https://doi.org/10.1142/9789812836939_0041
The immune system of higher organisms is, by any standard, complex. To date, using reductionist techniques, immunologists have elucidated many of the basic principles of how the immune system functions, yet our understanding is still far from complete. In an era of high throughput measurements, it is already clear that the scientific knowledge we have accumulated has itself grown larger than our ability to cope with it, and thus it is increasingly important to develop bioinformatics tools with which to navigate the complexity of the information that is available to us. Here, we describe ImmuneXpresso, an information extraction system, tailored for parsing the primary literature of immunology and relating it to experimental data. The immune system is very much dependent on the interactions of various white blood cells with each other, either in synaptic contacts, at a distance using cytokines or chemokines, or both. Therefore, as a first approximation, we used ImmuneXpresso to create a literature derived network of interactions between cells and cytokines. Integration of cell-specific gene expression data facilitates cross-validation of cytokine mediated cell-cell interactions and suggests novel interactions. We evaluate the performance of our automatically generated multi-scale model against existing manually curated data, and show how this system can be used to guide experimentalists in interpreting multi-scale, experimental data. Our methodology is scalable and can be generalized to other systems.
https://doi.org/10.1142/9789812836939_0042
High-throughput (HTP) technologies offer the capability to evaluate the genome, proteome, and metabolome of an organism at a global scale. This opens up new opportunities to define complex signatures of disease that involve signals from multiple types of biomolecules. However, integrating these data types is difficult due to the heterogeneity of the data. We present a Bayesian approach to integration that uses posterior probabilities to assign class memberships to samples using individual and multiple data sources; these probabilities are based on lower-level likelihood functions derived from standard statistical learning algorithms. We demonstrate this approach on microbial infections of mice, where the bronchial alveolar lavage fluid was analyzed by three HTP technologies, two proteomic and one metabolomic. We demonstrate that integration of the three datasets improves classification accuracy to ~89% from the best individual dataset at ~83%. In addition, we present a new visualization tool called Visual Integration for Bayesian Evaluation (VIBE) that allows the user to observe classification accuracies at the class level and evaluate classification accuracies on any subset of available data types based on the posterior probability models defmed for the individual and integrated data.
https://doi.org/10.1142/9789812836939_0043
The following sections are included:
https://doi.org/10.1142/9789812836939_0044
The identification of genes acting synergistically as master regulators of physiologic and pathologic cellular phenotypes is a key open problem in systems biology, Here we use a molecular interaction based approach to identify the repertoire of transcription factors (TFs) of a master regulatory module responsible for synergistic activation of a tumor-specific signature. Specifically, we used the ARACNe algorithm and other computational tools to infer regulatory interactions responsible for initiating and maintaining the mesenchymal phenotype of Glioblastoma Multiforme (GBM), previously associated with the poorest disease prognosis. Expression of mesenchymal genes is a hallmark of aggressiveness but the upstream regulators of the signature are unknown. Starting from the unbiased analysis of all TFs, we identify a highly interconnected module of six TFs jointly regulating >75% of the genes in the signature. Two TFs (Stat3 and C/EBPb), in particular, display features of initiators and master regulators of module activity. Biochemical validation confirms that the TFs in the module bind to the inferred promoters in vivo and ectopic expression of the master TFs activates expression of the mesenchymal signature. These effects are sufficient to trigger mesenchymal transformation of neural stem cells, which become highly tumorigenic in vivo, and promote migration and invasion. Conversely, silencing of Stat3 and C/EBPb in human glioma cells leads to collapse of the mesenchymal signature and reduction of tumor aggressiveness. Our results reveal that activation of a small transcriptional module is necessary and sufficient to induce a mesenchymal phenotype in malignant brain tumors.
https://doi.org/10.1142/9789812836939_0045
Motivation: For different tumour types, extended knowledge about the molecular mechanisms involved in tumorigenesis is lacking. Looking for copy number variations (CNV) by Comparative Genomic Hybridization (CGH) can help however to determine key elements in this tumorigenesis. As genome-wide array CGH gives the opportunity to evaluate CNV at high resolution, this leads to huge amount of data, necessitating adequate mathematical methods to carefully select and interpret these data.
Results: Two groups of patients differing in cancer subtype were defined in two publicly available array CGH data sets as well as in our own data set on ovarian cancer. Chromosomal regions characterizing each group of patients were gathered using recurrent hidden Markov Models (HMM). The differential regions were reduced to a subset of features for classification by integrating different univariate feature selection methods. Weighted Least Squares Support Vector Machines (LS-SVM), a supervised classification method which takes unbalancedness of data sets into account, resulted in leave-one-out or 10-fold cross-validation accuracies ranging from 88 to 95.5%.
Conclusion: The combination of recurrent HMMs for the detection of copy number alterations with LS-SVM classifiers offers a novel methodological approach for classification based on copy number alterations. Additionally, this approach limits the chromosomal regions that are necessary to classify patients according to cancer subtype.
https://doi.org/10.1142/9789812836939_0046
Motivation: We present a probabilistic model called a Joint Intervention Network (JIN) for inferring interactions among a chosen set of regulator genes. The input to the method are expression changes of downstream indicator genes observed under the knock-out of the regulators. JIN can use any number of perturbation combinations for model inference (e.g. single, double, and triple knock-outs). Results/Conclusions: We applied JIN to a Vibrio cholerae regulatory network to uncover mechanisms critical to its environmental persistence. V. cholerae is a facultative human pathogen that causes cholera in humans and responsible for seven pandemics. We analyzed the expression response of 17 V. cholerae biofilm indicator genes under various single and multiple knock-outs of three known biofilm regulators. Using the inferred network, we were able to identify new genes involved in biofilm formation more accurately than clustering expression profiles.
https://doi.org/10.1142/9789812836939_0047
Influenza hemagglutinin mediates both cell-surface binding and cell entry by the virus. Mutations to hemagglutinin are thus critical in determining host species specificity and viral infectivity. Previous approaches have primarily considered point mutations and sequence conservation; here we develop a complementary approach using mutual information to examine concerted mutations. For hemagglutinin, several overlapping selective pressures can cause such concerted mutations, including the host immune response, ligand recognition and host specificity, and functional requirements for pH-induced activation and membrane fusion. Using sequence mutual information as a metric, we extracted clusters of concerted mutation sites and analyzed them in the context of crystallographic data. Comparison of influenza isolates from two subtypes—human H3N2 strains and human and avian H5N1 strains—yielded substantial differences in spatial localization of the clustered residues. We hypothesize that the clusters on the globular head of H3N2 hemagglutinin may relate to antibody recognition (as many protective antibodies are known to bind in that region), while the clusters in common to H3N2 and H5N1 hemagglutinin may indicate shared functional roles. We propose that these shared sites may be particularly fruitful for mutagenesis studies in understanding the infectivity of this common human pathogen. The combination of sequence mutual information and structural analysis thus helps generate novel functional hypotheses that would not be apparent via either method alone.
https://doi.org/10.1142/9789812836939_0048
Computational identification of prognostic biomarkers capable of withstanding follow-up validation efforts is still an open challenge in cancer research. For instance, several gene expression profiles analysis methods have been developed to identify gene signatures that can classify cancer sub-phenotypes associated with poor prognosis. However, signatures originating from independent studies show only minimal overlap and perform poorly when classifying datasets other than the ones they were generated from. In this paper, we propose a computational systems biology approach that can infer robust prognostic markers by identifying upstream Master Regulators, causally related to the presentation of the phenotype of interest. Such a strategy effectively extends and complements other existing methods and may help further elucidate the molecular mechanisms of the observed pathophysiological phenotype. Results show that inferred regulators substantially outperform canonical gene signatures both on the original dataset and across distinct datasets.
https://doi.org/10.1142/9789812836939_0049
Hwnan immunodeficiency virus-1 (HIV-1) in acquired immune deficiency syndrome (AIDS) relies on human host cell proteins in virtually every aspect of its life cycle. Knowledge of the set of interacting human and viral proteins would greatly contribute to our understanding of the mechanisms of infection and subsequently to the design of new therapeutic approaches. This work is the first attempt to predict the global set of interactions between HIV-1 and human host cellular proteins. We propose a supervised learning framework, where multiple information data sources are utilized, including cooccurrence of functional motifs and their interaction domains and protein classes, gene ontology annotations, posttranslational modifications, tissue distributions and gene expression profiles, topological properties of the human protein in the interaction network and the similarity of HIV-1 proteins to human proteins' known binding partners. We trained and tested a Random Forest (RF) classifier with this extensive feature set. The model's predictions achieved an average Mean Average Precision (MAP) score of 23%. Among the predicted interactions was for example the pair, HIV-1 protein tat and human vitamin D receptor. This interaction had recently been independently validated experimentally. The rank-ordered lists of predicted interacting pairs are a rich source for generating biological hypotheses. Amongst the novel predictions, transcription regulator activity. immune system process and macromolecular complex were the top most significant molecular function, process and cellular compartments, respectively. Supplementary material is available at URL http://www.cs.cmu.edu/~oznur/hiv/hivPPI.html.
https://doi.org/10.1142/9789812836939_0050
Recent advances in high-throughput genotyping have inspired increasing research interests in genome-wide association study for diseases. To understand underlying biological mechanisms of many diseases, we need to consider simultaneously the genetic effects across multiple loci. The large number of SNPs often makes multilocus association study very computationally challenging because it needs to explicitly enumerate all possible SNP combinations at the genome-wide scale. Moreover, with the large number of SNPs correlated, permutation procedure is often needed for properly controlling family-wise error rates. This makes the problem even more computationally demanding, since the test procedure needs to be repeated for each permuted data. In this paper, we present FastChi, an exhaustive yet efficient algorithm for genome-wide two-locus chi-square test. FastChi utilizes an upper bound of the two-locus chi-square test, which can be expressed as the sum of two terms – both are efficient to compute: the first term is based on the single-locus chi-square test for the given phenotype; and the second term only depends on the genotypes and is independent of the phenotype. This upper bound enables the algorithm to only perform the two-locus chi-square test on a small number of candidate SNP pairs without the risk of missing any significant ones. Since the second part of the upper bound only needs to be precomputed once and stored for subsequence uses, the advantage is more prominent in large permutation tests. Extensive experimental results demonstrate that our method is an order of magnitude faster than the brute force alternative.
https://doi.org/10.1142/9789812836939_0051
Open Science is gathering pace both as a grass roots effort amongst scientists to enable them to share the outputs of their research more effectively, and as a policy initiative for research funders to gain a greater return on their investment. In this workshop, we will discuss the current state of the art in collaborative research tools, the social challenges facing those adopting and advocating more openness, and the development of standards, policies and best practices for Open Science.
https://doi.org/10.1142/9789812836939_0052
The following sections are included: