![]() |
This volume contains about 40 papers covering many of the latest developments in the fast-growing field of bioinformatics. The contributions span a wide range of topics, including computational genomics and genetics, protein function and computational proteomics, the transcriptome, structural bioinformatics, microarray data analysis, motif identification, biological pathways and systems, and biomedical applications. There are also abstracts from the keynote addresses and invited talks.
The papers cover not only theoretical aspects of bioinformatics but also delve into the application of new methods, with input from computation, engineering and biology disciplines. This multidisciplinary approach to bioinformatics gives these proceedings a unique viewpoint of the field.
Sample Chapter(s)
Chapter 1: Exploring the Ocean's Microbes: Sequencing the Seven Seas (122 KB)
https://doi.org/10.1142/9781860947575_fmatter
COMMITTEES.
Referees.
PREFACE.
CONTENTS.
https://doi.org/10.1142/9781860947575_0001
No abstract received.
https://doi.org/10.1142/9781860947575_0002
For the past decade, there has been increasing interest in ontologies in the biomedical community. As interest has peaked. so has the confusion. The confusion stems from the multiple knowledge-representation languages used to encode ontologies (e.g., frame-based systems, Semantic Web standards such as RDF(S) and OWL, and languages created specifically by the bioinformatics community, such as OBO), where each language has explicit strengths and weaknesses. Biomedical scientists use ontologies for multiple purposes, from annotation of experimental data, to natural-language processing, to data integration, to construction of decision-support systems. Each of these purposes imposes different requirements concerning which entities ontologies should encode and how those entities should be encoded. Although the biomedical informatics community remains excited about ontologies, exactly what an ontology is and how it should be represented within a computer are points about which, with considerable questioning, we can see little uniformity of opinion. The confusion will persist until we can understand that different developers have very different requirements for ontologies, and therefore those developers will make very different assumptions about how ontologies should be created and structured. We will review those assumptions and the corresponding implications for ontology construction.
Our National Center for Biomedical Ontology (http://bioontology.org) is one of the seven national centers for biomedical computing formed under the NIH Roadmap. The Center takes a broad perspective on what ontologies are and how they should be developed and put to use. Our goal, simply put, is to help to eliminate much of the current confusion. The Center recognizes the importance of ontologies for use in a wide range of biomedical applications, and is developing new technology to make all relevant ontologies widely accessible, searchable, alignable, and useable within software systems. Ultimately, the Center will support the publication of biomedical ontologies online, much as we publish scientific knowledge in print media. The advent of biomedical knowledge that is widely available in machine-processable form will alter the way that we think about science and perform scientific experiments. The biomedical community soon will enter an era in which scientific knowledge will become more accessible, more useable, and more precise, and in which new methods will be needed to support a radically different kind of scientific publishing.
https://doi.org/10.1142/9781860947575_0003
No abstract received.
https://doi.org/10.1142/9781860947575_0004
No abstract received.
https://doi.org/10.1142/9781860947575_0005
No abstract received.
https://doi.org/10.1142/9781860947575_0006
No abstract received.
https://doi.org/10.1142/9781860947575_0007
Recent advances in biological imaging technologies have enabled the observation of living cells with high resolution during extended periods of time and are impacting biological research in such different areas as high-throughput image-base drug screening, cellular therapies, cell and developmental biology and gene expression studies. Deciphering the complex machinery of cell functions and dysfunction necessitates indeed large-scale multidimensional image-based assays to cover the wide range of highly variable and intricate properties of biological systems. However, understanding the wealth of data generated by multidimensional microscopy depends critically on decoding the visual information contained therein and on the availability of the tools to do so. Innovative automatic techniques to extract quantitative data from image sequences are therefore of major interest. I will present methods we have recently developed to perform the computational analysis of image sequences coming from multidimensional microscopy, with particular emphasis on tracking and motion analysis for 3D+t images sequences using active contours and multiple particle tracking.
https://doi.org/10.1142/9781860947575_0008
No abstract received.
https://doi.org/10.1142/9781860947575_0009
No abstract received.
https://doi.org/10.1142/9781860947575_0010
Despite recent developments in protein structure prediction, an accurate new fold prediction algorithm remains elusive. One of the challenges facing current techniques is the size and complexity of the space containing possible structures for a query sequence. Traditionally, to explore this space fragment assembly approaches to new fold prediction have used stochastic optimization techniques. Here we examine deterministic algorithms for optimizing scoring functions in protein structure prediction. Two previously unused techniques are applied to the problem, called the Greedy algorithm and the Hill-climbing algorithm. The main difference between the two is that the latter implements a technique to overcome local minima. Experiments on a diverse set of 276 proteins show that the Hill-climbing algorithms consistently outperform existing approaches based on Simulated Annealing optimization (a traditional stochastic technique) in optimizing the root mean squared deviation (RMSD) between native and working structures.
https://doi.org/10.1142/9781860947575_0011
Motivation: A key class of membrane proteins contains one or more transmembrane (TM) helices, traversing the membrane lipid bilayer. Various properties such as the length, arrangement and topology or orientation of TM helices, are closely related to a protein's functions. Although a range of methods have been developed to predict TM helices and their topologies, no single method consistently outperforms the others. In addition, topology prediction has much lower accuracy than helix prediction, and thus requires continuous improvements.
Results: We develop a method based on support vector machines (SVM) in a hierarchical framework to predict TM helices first, followed by their topology. By partitioning the prediction problem into two steps, specific input features can be selected and integrated in each step. We also propose a novel scoring function for topology models based on membrane protein folding process. When benchmarked against other methods in terms of performance, our approach achieves the highest scores at 86% in helix prediction (Q2) and 91% in topology prediction (TOPO) for the high-resolution data set, resulting in an improvement of 6% and 14% in their respective categories over the second best method. Furthermore, we demonstrate the ability of our method to discriminate between membrane and non-membrane proteins, with higher than 99% in accuracy. When tested on a small set of newly solved structures of membrane proteins, our method overcomes some of the difficulties in predicting TM helices by incorporating multiple biological input features.
https://doi.org/10.1142/9781860947575_0012
Protein structure prediction is one of the most important and difficult problems in computational molecular biology. Protein threading represents one of the most promising techniques for this problem. One of the critical steps in protein threading, called fold recognition, is to choose the best-fit template for the query protein with the structure to be predicted. The standard method for template selection is to rank candidates according to the z-score of the sequence-template alignment. However, the z-score calculation is time-consuming, which greatly hinders structure prediction at a genome scale. In this paper, we present a machine learning approach that treats the fold recognition problem as a regression task and uses a least-squares boosting algorithm (LS_Boost) to solve it efficiently. We test our method on Lindahl's benchmark and compare it with other methods. According to our experimental results we can draw the conclusions that: (1) Machine learning techniques offer an effective way to solve the fold recognition problem. (2) Formulating protein fold recognition as a regression rather than a classification problem leads to a more effective outcome. (3) Importantly, the LS_Boost algorithm does not require the calculation of the z-score as an input, and therefore can obtain significant computational savings over standard approaches. (4) The LS_Boost algorithm obtains superior accuracy, with less computation for both training and testing, than alternative machine learning approaches such as SVMs and neural networks, which also need not calculate the z-score. Finally, by using the LS_Boost algorithm, one can identify important features in the fold recognition protocol, something that cannot be done using a straightforward SVM approach.
https://doi.org/10.1142/9781860947575_0013
The success in backbone resonance sequential assignment is fundamental to protein three dimensional structure determination via NMR spectroscopy. Such a sequential assignment can roughly be partitioned into three separate steps, which are grouping resonance peaks in multiple spectra into spin systems, chaining the resultant spin systems into strings, and assigning strings of spin systems to non-overlapping consecutive amino acid residues in the target protein. Separately dealing with these three steps has been adopted in many existing assignment programs, and it works well on protein NMR data that is close to ideal quality, while only moderately or even poorly on most real protein datasets, where noises as well as data degeneracy occur frequently. We propose in this work to partition the sequential assignment not into physical steps, but only virtual steps, and use their outputs to cross validate each other. The novelty lies in the places where the ambiguities in the grouping step will be resolved in finding the highly confident strings in the chaining step, and the ambiguities in the chaining step will be resolved by examining the mappings of strings in the assignment step. In such a way, all ambiguities in the sequential assignment will be resolved globally and optimally. The resultant assignment program is called GASA, which was compared to several recent similar developments RIBRA, MARS, PACES and a random graph approach. The performance comparisons with these works demonstrated that GASA might be more promising for practical use.
https://doi.org/10.1142/9781860947575_0014
Traditional algorithms for the structure determination of native proteins by solution nuclear magnetic resonance (NMR) spectroscopy require a large number of experimental restraints. These algorithms formulate the structure determination problem as the computation of a structure or a set of similar structures that best fit the restraints. However, for both laboratory-denatured and natively-disordered proteins, the number of restraints measured by the current NMR techniques is well below that required by traditional algorithms. Furthermore, there presumably exists a heterogeneous set of structures in either the denatured or disordered state. We present a data-driven algorithm capable of computing a set of structures (ensemble) directly from sparse experimental restraints. For both denatured and disordered proteins, we formulate the structure determination problem as the computation of an ensemble of structures from the restraints. In this formulation, each experimental restraint is a distribution. Compared with previous algorithms, our algorithm can extract more structural information from the experimental data. In our algorithm, all the backbone conformations consistent with the data are computed by solving a series of low-degree monomials (yielding exact solutions in closed form) and systematic search with pruning. The algorithm has been successfully applied to determine the structural ensembles of two denatured proteins, acyl-coenzyme A binding protein (ACBP) and eglin C, using real experimental NMR data.
https://doi.org/10.1142/9781860947575_0015
Root mean square deviation (RMSD) is often used to measure the difference between structures. We show mathematically that, for multiple structure alignment, the minimum RMSD (weighted at aligned positions or unweighted) for all pairs is the same as the RMSD to the average of the structures. Thus, using RMSD implies that the average is a consensus structure. We use this property to validate and improve algorithms for multiple structure alignment. In particular, we establish the properties of the average structure, and show that an iterative algorithm proposed by Sutcliffe and co-authors can find it efficiently — each iteration takes linear time and the number of iterations is small. We explore the residuals after alignment and assign weights to positions to identify aligned cores of structures. Observing this property also calls into question whether global RMSD is the right way to compare multiple protein structures, and guides the search for more local techniques.
https://doi.org/10.1142/9781860947575_0016
This paper presents a novel methodology to analyze low resolution (e.g., 6Å to 10Å) protein density map, that can be obtained through electron cryomicroscopy. At such resolutions, it is often not possible to recognize the backbone chain of the protein, but it is possible to identify individual structural elements (e.g., α-helices and β-sheets). The methodology proposed in this paper performs gradients analysis to recognize volumes in the density map and to classify them. In particular, the focus is on the reliable identification of α-helices. The methodology has been implemented in a tool, called Helix Tracer, and successfully tested with simulated structures, modeled from the Protein Data Bank at 10Å resolution. The results of the study have been compared with the only other known tool with similar capabilities (Helixhunter), denoting significant improvements in recognition and precision.
https://doi.org/10.1142/9781860947575_0017
Computational search of genomes for RNA secondary structure is an important approach to the annotation of non-coding RNAs. The bottleneck of the search is sequence-structure alignment, which is often computationally intensive. A plausible solution is to devise effective filters that can efficiently remove segments unlikely to contain the desired structure patterns in the genome and to apply search only on the remaining portions. Since filters can be substructures of the RNA to be searched, the strategy to select which substructures to use as filters is critical to the overall search speed up. Such an issue becomes more involved when the structure contains pseudoknots; approaches that can filter pseudoknots are yet available. In this paper, a new effective filtration scheme is introduced to filter RNA pseudoknots. Based upon the authors' earlier work in tree-decomposable graph model for RNA pseudoknots, the new scheme can automatically derive a set of filters with the overall optimal filtration ratio. Search experiments on both synthetic and biological genomes showed that, with this filtration approach, RNA structure search can speed up 11 to 60 folds whiling maintaining the same search sensitivity and specificity of without the filtration. In some cases, the filtration even improves the specificity that is already high.
https://doi.org/10.1142/9781860947575_0018
Thermodynamic RNA secondary structure prediction is an important recipe for the latest generation of functional non-coding RNA finding tools. However, the predicted energy is not strong enough by itself to distinguish a single functional non-coding RNA from other RNA. Here, we analyze how well an RNA molecule folds into a particular structural class with a restricted folding algorithm called Thermodynamic Matcher (TDM). We compare this energy value to that of randomized sequences. We construct and apply TDMs for the non-coding RNA families RNA I and hammerhead ribozyme type III and our results show that using TDMs rather than universal minimum free energy folding allows for highly significant predictions.
https://doi.org/10.1142/9781860947575_0019
Replication of time series in microarray experiments is costly. To analyze time series data with no replicate, many model-specific approaches have been proposed. However, they fail to identify the genes whose expression patterns do not fit the pre-defined models. Besides, modeling the temporal expression patterns is difficult when the dynamics of gene expression in the experiment is poorly understood. We propose a method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. In the PEM method, we assume the gene expressions vary smoothly in the temporal domain. This assumption is comparatively weak and hence the method is general enough to identify genes expressed in unexpected patterns. To identify the differentially expressed genes, a new statistic is developed by comparing the energies of two convoluted profiles. We further improve the statistic for microarray analysis by introducing the concept of partial energy. The PEM statistic is incorporated into the permutation based SAM framework for significance analysis. We evaluated the PEM method with an artificial dataset and two published time course cDNA microarray datasets on yeast. The experimental results show the robustness and the generality of the PEM method. It outperforms the previous versions of SAM and the spline based EDGE approaches in identifying genes of interest, which are differentially expressed in various manner.
https://doi.org/10.1142/9781860947575_0020
In most real-life gene expression data sets, there are often multiple sample classes with ordinals, which are categorized into the normal or diseased type. The traditional feature or attribute selection methods consider multiple classes equally without paying attention to the up/down regulation across the normal and diseased types of classes, while the specific gene selection methods particularly consider the differential expressions across the normal and diseased, but ignore the existence of multiple classes. In this paper, for improving the biomarker discovery, we propose to make the best use of these two aspects: the differential expressions (that can be viewed as the domain knowledge of gene expression data) and the multiple classes (that can be viewed as a kind of data set characteristic). Therefore, we simultaneously take into account these two aspects by employing the 1-rank generalized matrix approximations (GMA). Our results show that the consideration of both aspects can not only improve the accuracy of classifying the samples, but also provide a visualization method to effectively analyze the gene expression data on both genes and samples. Based on the GMA mechanism, we further propose an algorithm for obtaining the compact biomarker by reducing the redundancy.
https://doi.org/10.1142/9781860947575_0021
A current major focus in genomics is the large-scale collection of genotype data in populations in order to detect variations in the population. The variation data are sought in order to address fundamental and applied questions in genetics that concern the haplotypes in the population. Since almost all the collected data is in the form of genotypes, but the downstream genetics questions concern haplotypes, the standard approach to this issue has been to try to first infer haplotypes from the genotypes, and then answer the downstream questions using the inferred haplotypes. That two-stage approach has potential deficiencies, giving rise to the general question of how well one can answer the downstream questions using genotype data without first inferring haplotypes, and also giving rise to the goal of computing the range of downstream answers that would be obtained over the range of possible inferred haplotype solutions. This paper provides some tools for the study of those issues, and some partial answers. We present algorithms to solve downstream questions concerning the minimum amount of recombination needed to derive given genotypic data, without first fixing a choice of haplotypes. We apply these algorithms to the goal of finding recombination hotspots, obtaining as good results as a published method that first infers haplotypes; and to the case of estimating the minimum amount of recombination needed to derive the true haplotypes underlying the genotypic data, obtaining weaker results compared to first inferring haplotypes using the program PHASE. Hence our tools allow an initial study of the two-stage versus one-stage issue, in the context of specific downstream questions, but our experiments certainly do not fully resolve the issue.
https://doi.org/10.1142/9781860947575_0022
Given two signed multi-chromosomal genomes Π and Γ with the same gene set, the problem of sorting by translocations (SBT) is to find a shortest sequence of translocations transforming Π to Γ, where the length of the sequence is called the translocation distance between Π and Γ. In 1996, Hannenhalli gave the formula of the translocation distance for the first time, based on which an O(n3) algorithm for SBT was given. In 2005, Anne Bergeron et al. revisited this problem and gave an elementary proof for the formula of the translocation distance which leads to a new O(n3) algorithm for SBT. In this paper, we show how to extend Anne Bergron's algorithm for SBT to include deletions, which allows us to compare genomes containing different genes. We present an asymptotically optimal algorithm for transforming Π to Γ by translocations and deletions, providing a feasible sequence with length at most OPT+2, where OPT is the minimum number of translocations and deletions transforming Π to Γ. Furthermore, this analysis can be used to approximate the minimum number of translocations and insertions transforming one genome to another.
https://doi.org/10.1142/9781860947575_0023
The abundance of repeat elements in the maize genome complicates its assembly. Retrotransposons alone are estimated to constitute at least 50% of the genome. In this paper, we introduce a problem called retroscaffolding, which is a new variant of the well known problem of scaffolding that orders and orients a set of assembled contigs in a genome assembly project. The key feature of this new formulation is that it takes advantage of the structural characteristics and abundance of a particular type of retrotransposons called the Long Terminal Repeat (LTR) retrotransposons. This approach is not meant to supplant but rather to complement other scaffolding approaches. The advantages of retroscaffolding are two fold: (i) it allows detection of regions containing LTR retrotransposons within the unfinished portions of a genome and can therefore guide the process of finishing, and (ii) it provides a mechanism to lower sequencing coverage without impacting the quality of the final assembled genic portions. Sequencing and finishing costs dominate the expenditures in whole genome projects, and it is often desired in the interest of saving cost to reduce such efforts spent on repetitive regions of a genome. The retroscaffolding technique provides a viable mechanism to this effect. Results of preliminary studies on maize genomic data validate the utility of our approach. We also report on the on-going development of an algorithmic framework to perform retroscaffolding.
https://doi.org/10.1142/9781860947575_0024
Existing HIV-1 genotyping systems require a computationally expensive phase of multiple sequence alignments and the alignments must have a sufficiently high quality for accurate genotyping. This is particularly a challenge when the number of strains is large. Here we propose a whole genome composition distance (WGCD) to measure the evolutionary closeness between two HIV-1 whole genomic RNA sequences, and use that measure to develop an HIV-1 genotyping system. Such a WGCD-based genotyping system avoids multiple sequence alignments and does not require any pre-knowledge about the evolutionary rates. Experimental results showed that the system is able to correctly identify the known subtypes, sub-subtypes, and individual circulating recombinant forms.
https://doi.org/10.1142/9781860947575_0025
Assuming no interference, a multi-locus genetic likelihood is implemented based on a mathematical model of the recombination process in meiosis that accounts for events up to double crossovers in the genetic interval for any specified order of genetic markers. The mathematical model is realized with a straightforward algorithm that implements the likelihood computation process. The time complexity of the straightforward algorithm is exponential without bound in the number of genetic markers and implementation of the model for more than 7 genetic markers is not feasible, motivating the need for a novel algorithm. A recursive linking algorithm is proposed that decomposes the pool of genetic markers into segments and renders the model implementable for a large number of genetic markers. The recursive algorithm is shown to reduce the order of time complexity from exponential to linear. The improvement in time complexity is shown theoretically by a worst-case analysis of the algorithm and supported by run time results using data on linkage group-I of the fungal genome Neurospora crassa.
https://doi.org/10.1142/9781860947575_0026
The production of large quantities of diploid genotype data has created a need for computational methods for large-scale inference of haplotypes from genotypes. One promising approach to the problem has been to infer possible phylogenies explaining the observed genotypes in terms of putative descendants of some common ancestral haplotype. The first attempts at this problem proceeded on the restrictive assumption that observed sequences could be explained by a perfect phylogeny, in which each variant locus is presumed to have mutated exactly once over the sampled population’s history. Recently, the perfect phylogeny model was relaxed and the problem of reconstructing an imperfect phylogeny (IPPH) from genotype data was considered. A polynomial time algorithm was developed for the case when a single site is allowed to mutate twice, but the general problem remained open. In this work, we solve the general IPPH problem and show for the first time that it is possible to infer optimal q-near-perfect phylogenies from diploid genotype data in polynomial time for any constant q, where q is the number of “extra” mutations required in the phylogeny beyond what would be present in a perfect phylogeny. This work has application to the haplotype phasing problem as well as to various related problems in phylogenetic inference, analysis of sequence variability in populations, and association study design. Empirical studies on human data of known phase show this method to be competitive with the leading phasing methods and provide strong support for the value of continued research into algorithms for general phylogeny construction from diploid data.
https://doi.org/10.1142/9781860947575_0027
Haplotype inference by pure parsimony (HIPP) is known to be NP-Hard. Despite this, many algorithms successfully solve HIPP instances on simulated and real data. In this paper, we explore the connection between algebraic rank and the HIPP problem, to help identify easy and hard instances of the problem. The rank of the input matrix is known to be a lower bound on the size an optimal HIPP solution. We show that this bound is almost surely tight for data generated by randomly pairing p haplotypes derived from a perfect phylogeny when the number of distinct population members is more than (for some positive ∊). Moreover, with only a constant multiple more population members, and a common mutation, we can almost surely recover an optimal set of haplotypes in polynomial time. We examine the algebraic effect of allowing recombination, and bound the effect recombination has on rank. In the process, we prove a stronger version of the standard haplotype lower bound. We also give a complete classification of the rank of a haplotype matrix derived from a galled tree. This classification identifies a set of problem instances with recombination when the rank lower bound is also tight for the HIPP problem.
https://doi.org/10.1142/9781860947575_0028
Oligo-based expression microarrays from Affymetrix typically contain thousands of redundant probe sets that match different regions of the same gene. We used linear regression and correlation to survey redundant probe set behavior across nearly 500 quality-screened experiments from the Arabidopsis ATH1 array manufactured by Affymetrix. We found that expression values from redundant probe set pairs were often poorly correlated. Pre-filtering expression results using MAS5.0 “present-absent” calls increased the overall percentage of well-correlated probe sets. However, poor correlation was still observed for a substantial number of probe set pairs. Visual inspection of non-correlated probe sets’ target genes suggests that some may be inappropriately merged gene models and represent independently expressed, but neighboring loci. Others may reflect differential regulation of alternative 3-prime end processing. Results are on-line at http://www.transvar.org/exp_cor/analysis.
https://doi.org/10.1142/9781860947575_0029
Structure motifs are amino acid packing patterns that occur frequently within a set of protein structures. We define a labeled graph representation of protein structure in which vertices correspond to amino acid residues and edges connect pairs of residues and are labeled by (1) the Euclidian distance between the Cα atoms of the two residues and (2) a boolean indicating whether the two residues are in physical/chemical contact. Using this representation, a structure motif corresponds to a labeled clique that occurs frequently among the graphs representing the protein structures. The pairwise distance constraints on each edge in a clique serve to limit the variation in geometry among different occurrences of a structure motif. We present an efficient constrained subgraph mining algorithm to discover structure motifs in this setting. Compared with contact graph representations, the number of spurious structure motifs is greatly reduced.
Using this algorithm, structure motifs were located for several SCOP families including the Eukaryotic Serine Proteases, Nuclear Binding Domains, Papain-like Cysteine Proteases, and FAD/NAD-linked Reductases. For each family, we typically obtain a handful of motifs within seconds of processing time. The occurrences of these motifs throughout the PDB were strongly associated with the original SCOP family, as measured using a hyper-geometric distribution. The motifs were found to cover functionally important sites like the catalytic triad for Serine Proteases and co-factor binding sites for Nuclear Binding Domains. The fact that many motifs are highly family-specific can be used to classify new proteins or to provide functional annotation in Structural Genomics Projects.
https://doi.org/10.1142/9781860947575_0030
The discovery of motifs in DNA sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics, although a large number of computational methods have been proposed in the past decade. Among these methods, the Gibbs sampling strategy has shown great promise and is routinely used for finding regulatory motif elements in the promoter regions of co-expressed genes. In this paper, we present an enhancement to the Gibbs sampling method when the expression data of the concerned genes is given. A sequence weighting scheme is proposed by explicitly taking gene expression variation into account in Gibbs sampling. That is, every putative motif element is assigned a weight proportional to the fold change in the expression level of its downstream gene under a single experimental condition, and a position specific scoring matrix (PSSM) is estimated from these weighted putative motif elements. Such an estimated PSSM might represent a more accurate motif model since motif elements with dramatic fold changes in gene expression are more likely to represent true motifs. This weighted Gibbs sampling method has been implemented and successfully tested on both simulated and biological sequence data. Our experimental results demonstrate that the use of sequence weighting has a profound impact on the performance of a Gibbs motif sampling algorithm.
https://doi.org/10.1142/9781860947575_0031
Predicting novel cleavage sites for HIV-1 protease in non-viral proteins is a difficult task because of the scarcity of previous cleavage data on proteins in a native state. We introduce a three-level hierarchical classifier which combines information from experimentally verified short oligopeptides, secondary structure and solvent accessibility information from prediction servers to predict potential cleavage sites in non-viral proteins. The best classifier using secondary structure information on the second level classification of the hierarchical classifier is the one using logistic regression. By using this level of classification, the false positive ratio was reduced by more than half compared to the first level classifier using only the oligopeptide cleavage information. The method can be applied on other protease specificity problems too, to combine information from oligopeptides and structure from native proteins.
https://doi.org/10.1142/9781860947575_0032
Motif discovery is a crucial part of regulatory network identification, and therefore widely studied in the literature. Motif discovery programs search for statistically significant, well-conserved and over-represented patterns in given promoter sequences. When gene expression data is available, there are mainly three paradigms for motif discovery; cluster-first, regression, and joint probabilistic. The success of motif discovery depends highly on the homogeneity of input sequences, regardless of paradigm employed. In this work, we propose a methodology for getting homogeneous subsets from input sequences for increased motif discovery performance. It is a unification of cluster-first and regression paradigms based on iterative cluster re-assignment. The experimental results show the effectiveness of the methodology.
https://doi.org/10.1142/9781860947575_0033
Biological processes are always carried out through large numbers of genes (and their products) and these activities are often organized into different cellular pathways: sets of genes that cooperate to finish specific biological functions. Owing to the development of microarray technology and its ability to simultaneously measure the expression of thousands of genes, effective algorithms to reveal biologically significant pathways are possible. However, some open problems such as large amount of noise in microarrays and the fact that most biological processes are overlapping and active only on partial conditions pose great challenges to researchers. In this paper, we proposed a novel approach to identify overlapping pathways via extracting partial expression profiles from coherent cliques of clusters scattered on different conditions. We firstly decompose gene expression data into highly overlapping segments and partition genes into clusters in each segment; then we organize all the resulting clusters as a cluster graph and search coherent cliques of clusters; finally we extract expression profiles from coherent cliques and shape biological pathways as genes consistent with these profiles. We compare our algorithm with several recent models and the experimental results justify the superiorities of our approach: robustly identifying overlapping pathways in arbitrary set of conditions and consequently discovering more biologically significant pathways in terms of enrichment of gene functions.
https://doi.org/10.1142/9781860947575_0034
Cellular pathways are composed of multiple reactions and interactions mediated by genes. Many of these reactions are common to multiple pathways, and each reaction might be potentially mediated by multiple genes in the same genome. Existing pathway reconstruction procedures assign a gene to all pathways in which it might catalyze a reaction, leading to a many-to-many mapping of genes to pathways. However, it is unlikely that all genes that are capable of mediating a certain reaction are involved in all the pathways that contain it. Rather, it is more likely that each gene is optimized to function in specific pathway(s). Hence, existing procedures for pathway construction produce assignments that are ambiguous. Here we present a probabilistic algorithm for the assignment of genes to pathways that addresses this problem and reduces this ambiguity. Our algorithm uses expression data, database annotations and similarity data to infer the most likely assignments, and estimate the affinity of each gene with the known cellular pathways. We apply the algorithm to metabolic pathways in Yeast and compare the results to assignments that were experimentally verified.
https://doi.org/10.1142/9781860947575_0035
The genetic analysis of spatial patterns of gene expression relies on the direct visualization of the presence or absence of gene products (mRNA or protein) at a given developmental stage (time) of a developing animal. The raw data produced by these experiments include images of the Drosophila embryos showing a particular gene expression pattern revealed by a gene-specific probe. The identification of genes showing spatial and temporal overlaps in their expression patterns is fundamentally important to formulating and testing gene interaction hypotheses. Comparison of expression patterns is most biologically meaningful when images from a similar time point (developmental stage range) are compared. In this paper, we propose a computational system for automatic developmental stage classification by image analysis. This classification system uses image textural properties at a sub-block level across developmental stages as distinguishing features. Gabor filters are applied to extract features of image sub-blocks. Robust implementations of Linear Discriminant Analysis (LDA) are employed to extract the most discriminant features for the classification. Experiments on a collection of 2705 expression pattern images from early stages show that the proposed system significantly outperforms previously reported results in terms of classification accuracy, which shows high promise of the proposed system in reducing the time taken by biologists to assign the embryo stage range.
https://doi.org/10.1142/9781860947575_0036
Recent research efforts have made available genome-wide, high-throughput protein-protein interaction (PPI) maps for several model organisms. This has enabled the systematic analysis of PPI networks, which has become one of the primary challenges for the system biology community. In this study, we attempt to understand better the topological structure of PPI networks by comparing them against man-made communication networks, and more specifically, the Internet.
Our comparative study is based on a comprehensive set of graph metrics. Our results exhibit an interesting dichotomy. On the one hand, both networks share several macroscopic properties such as scale-free and small-world properties. On the other hand, the two networks exhibit significant topological differences, such as the cliqueshness of the highest degree nodes. We attribute these differences to the distinct design principles and constraints that both networks are assumed to satisfy. We speculate that the evolutionary constraints that favor the survivability and diversification are behind the building process of PPI networks, whereas the leading force in shaping the Internet topology is a decentralized optimization process geared towards efficient node communication.
https://doi.org/10.1142/9781860947575_0037
Determining the function of proteins is a problem with immense practical impact on the identification of inhibition targets and the causes of side effects. Unfortunately, experimental determination of protein function is expensive and time consuming. For this reason, algorithms for computational function prediction have been developed to focus and accelerate this effort. These algorithms are comparison techniques which identify matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Matches of statistically significant geometric and chemical similarity can identify targets with active sites cognate to the matching motif. Unfortunately statistically significant matches can include false positive matches to functionally unrelated proteins. We target this problem by presenting Cavity Aware Match Augmentation (CAMA), a technique which uses C–spheres to represent active clefts which must remain vacant for ligand binding. CAMA rejects matches to targets without similar binding volumes. On 18 sample motifs, we observed that introducing C–spheres eliminated 80% of false positive matches and maintained 87% of true positive matches found with identical motifs lacking C–spheres. Analyzing a range of C–sphere positions and sizes, we observed that some high-impact C– spheres eliminate more false positive matches than others. High-impact C–spheres can be detected with a geometric analysis we call Cavity Scaling, permitting us to refine our initial cavity-aware motifs to contain only high-impact C–spheres. In the absence of expert knowledge, Cavity Scaling can guide the design of cavity-aware motifs to eliminate many false positive matches.
https://doi.org/10.1142/9781860947575_0038
Prediction of subcellular localization of proteins is important for genome annotation, protein function prediction, and drug discovery. We present a prediction method for Gram-negative bacteria that uses ten one-versus-one support vector machine (SVM) classifiers, where compartment-specific biological features are selected as input to each SVM classifier. The final prediction of localization sites is determined by integrating the results from ten binary classifiers using a combination of majority votes and a probabilistic method. The overall accuracy reaches 91.4%, which is 1.6% better than the state-of-the-art system, in a ten-fold cross-validation evaluation on a benchmark data set. We demonstrate that feature selection guided by biological knowledge and insights in one-versus-one SVM classifiers can lead to a significant improvement in the prediction performance. Our model is also used to produce highly accurate prediction of 92.8% overall accuracy for proteins of dual localizations.
https://doi.org/10.1142/9781860947575_0039
MHC (Major Histocompatibility Complex) proteins are categorized under the heterodimeric integral membrane proteins. The MHC molecules are divided into 2 subclasses, class I and class II. Two classes differ from each other in size of their binding pockets. Predicting the affinity of these peptides is important for vaccine design. It is also vital for understanding the roles of immune system in various diseases. Due to the variability of the locations of the class II peptide binding cores, predicting the affinity of these peptides is difficult. In this paper, we proposed a new method for predicting the affinity of the MHC Class II binding peptides based on their sequences. Our method classifies peptides as binding and non-binding. Our prediction method is based on a 3-step algorithm. In the first step we identify the informative n-grams based on their frequencies for both classes. In the next step, the alphabet size is reduced. At the last step, by utilizing the informative n-grams, the class of a given sequence is predicted. We have tested our method on the MHC Bench IV-b data set [13], and compared with various other methods in the literature.
https://doi.org/10.1142/9781860947575_0040
The ratio of the number of nonsynonymous substitutions per site (Ka) over the number of synonymous substitutions per site (Ks) has often been used to detect positive selection. Investigators now commonly generate Ka/Ks ratio profiles in a sliding window to look for peaks and valleys in order to identify regions under positive selection. Here we show that the interpretation of peaks in the Ka/Ks profile as evidence for positive selection can be misleading. Genic regions with Ka/Ks > 1 in the MRG gene family, previously claimed to be under positive selection, are associated with a high frequency of polar amino acids with a high mutability. This association between an increased Ka and a high proportion of polar amino acids appears general and not limited to the MRG gene family or the sliding-window approach. For example, the sites detected to be under positive selection in the HIV1 protein-coding genes with a high posterior probability turn out to be mostly occupied by polar amino acids. These findings caution against invoking positive selection from Ka/Ks ratios and highlight the need for considering biochemical properties of the protein domains showing high Ka/Ks ratios. In short, a high Ka/Ks ratio may arise from the intrinsic properties of amino acids instead of from extrinsic positive selection.
https://doi.org/10.1142/9781860947575_0041
Accurate prediction of protein function and interactions from diverse genomic data is a key problem in systems biology. Heterogeneous data integration remains a challenge, particularly due to noisy data sources, diversity of coverage, and functional biases. It is thus important to understand the behavior and robustness of data integration methods in the context of various biological functions. We focus on the ability of Bayesian networks to predict functional relationships between proteins under a variety of conditions. This study considers the effect of network structure and compares expert estimated conditional probabilities with those learned using a generative method (expectation maximization) and a discriminative method (extended logistic regression). We consider the contributions of individual data sources and interpret these results both globally and in the context of specific biological processes. We find that it is critical to consider variation across biological functions; even when global performance is strong, some categories are consistently predicted well, and others are difficult to analyze. All learned models outperform the equivalent expert estimated models, although this effect diminishes as the amount of available data decreases. These learning techniques are not specific to Bayesian networks, and thus our conclusions should generalize to other methods for data integration. Overall, Bayesian learning provides a consistent benefit in data integration, but its performance and the impact of heterogeneous data sources must be interpreted from the perspective of individual functional categories.
https://doi.org/10.1142/9781860947575_0042
In protein identification through MS/MS spectrum, it is critical to accurately predict theoretical spectrum from a peptide sequence, which heavily depends on a quantitative understanding of the fragmentation process. To date, widely used database searching methods adopted a simple statistical model to predict theoretical spectrum, yielding a spectrum deviating significantly from the practical spectrum for some peptides and therefore preventing automated positive identification. Here, in order to derive an improved predicting model, we proposed a novel method to automatically learn the factors influencing fragmentation from a training set of MS/MS spectra. In this method, the determining of factors is converted into an optimization problem to minimize an objective function that measures the distance between experimental spectrum and theoretical one. Then, an iterative algorithm was proposed to minimize the non-linear objective function. We implemented the methods and tested them on experimental data. The examination of 1451 spectra is in good agreement with some known knowledge about peptide fragmentation, such as the tendency of cleavage towards the middle of peptide, and Pro’s preference of N-terminal cleavage. Moreover, on a testing set containing 1425 spectra, comparison between predicted and practical spectra generates a median correlation of 0.759, showing this method’s ability to predict a “realistic” spectrum. The results in this paper help to an accurate identification of protein through both database searching and de novo methods.
https://doi.org/10.1142/9781860947575_0043
Tandem mass spectrometry (MS/MS) has become a standard way for identifying peptides and proteins. A scoring function plays an important role in the MS/MS data analysis. De novo sequencing is the computational step to derive a peptide sequence from an MS/MS spectrum, normally by constructing the peptide that maximizes the scoring function. A number of polynomial time algorithms have been developed based on scoring functions that consider only either the N-terminal or C-terminal fragment ions of the peptide. It remains unknown whether the consideration of the internal fragment ions will still be polynomial time solvable. In this paper, we prove that the internal fragment ions make the de novo sequencing problem NP-complete. We also propose a regression model based scoring method to incorporate correlations between the fragment ions. Our scoring function is combined with PEAKS de novo sequencing algorithm and tested on ion trap data. The experimental results show that the regression model based scoring method can remarkably improve the de novo sequencing accuracy.
https://doi.org/10.1142/9781860947575_0044
Recent studies of gene expression in cancerous tumors have revealed that cancers presenting indistinguishable symptoms in the clinic can represent substantially different entities at the molecular level. The ability to distinguish between these different cancers makes possible more accurate prognoses and more finely targeted therapeutics. Making full use of this knowledge, however, requires characterizing commonly occurring cancer sub-types and the specific molecular abnormalities that produce them. Computational approaches to this problem to date have been hindered by the fact that tumors are highly heterogeneous masses typically containing cells at multiple stages of progression from healthy to aggressively malignant. We present a computational approach for taking advantage of tumor heterogeneity when characterizing tumor progression pathways by inferring those pathways from single-cell assays. Our approach uses phylogenetic algorithms to infer likely evolutionary sequences producing cell populations in single tumors, which are in turn used to create a profile of commonly used pathways across the patient population. This approach is combined with expectation maximization to infer unknown parameters used in the phylogeny construction. We demonstrate the approach on a set of fluorescent in situ hybridization (FISH) data measuring cell-by-cell gene and chromosome copy numbers in a large sample of breast cancers. The results validate the proposed computational methods by showing consistency with several previous findings on these cancers. They also provide novel insights into the mechanisms of tumor progression in these patients.
https://doi.org/10.1142/9781860947575_0045
In vitro studies of epithelial cell morphogenesis have demonstrated the influence of environment composition and orientation in the development of multicellular epithelial structures such as tubules and cysts. We have constructed a low resolution, discrete event simulation model and report on its use to explore how experimentally observed morphogenetic phenomena under four growth conditions might be generated and controlled. We identified simulation attributes that may have in vitro counterparts. We studied how changes in the logic governing simulated epithelial cell behavior might cause abnormal growth. Simulation results support the importance of a polarized response to the environment to the generation of a normal epithelial phenotype and show how disruptions of tight mechanistic control lead to aberrant growth characteristics.
https://doi.org/10.1142/9781860947575_0046
Many biological databases contain a large number of variables, among which events of interest may be very infrequent. Using a single data mining method to analyze such databases may not find adequate predictors. The HIV Drug Resistance Database at Stanford University stores sequential HIV-1 genotype-test results on patients taking antiretroviral drugs. We have analyzed the infrequent event of gene mutation changes by combining three data mining methods. We first use association rule analysis to scan through the database and identify potentially interesting mutation patterns with relatively high frequency. Next, we use logistic regression and classification trees to further investigate these patterns by analyzing the relationship between treatment history and mutation changes. Although the AUC measures of the overall prediction is not very high, our approach can effectively identify strong predictors of mutation change and thus focus the analytic efforts of researchers in verifying these results.
https://doi.org/10.1142/9781860947575_0047
In ovarian cancer treatment, the chemotherapy drug cisplatin often induce drug resistance after prolonged use, causing cancer relapse and the eventual deaths of patients. Cisplatin-induced drug resistance is known to involve a complex set of cellular changes but its molecular mechanism(s) remain unclear. In this study, we designed a systems biology approach to examine global protein level and network level changes by comparing Proteomics profiles between cisplatin-resistant cell lines and cisplatin-sensitive cell lines. First, we used an experimental proteomics method based on a Label-free Liquid Chromatography / Mass Spectrometry (LC/MS) platform to obtain a list of 119 proteins that are differentially expressed in the samples. Second, we expanded these proteins into a cisplatin-resistant activated sub-network, which consists of 1230 proteins in 1111 protein interactions. An examination of network topology features reveals the activated responses in the network are closely coupled. Third, we examined sub-network proteins using Gene Ontology categories. We found significant enrichment of proton-transporting ATPase and ATP synthase complexes in addition to protein binding proteins. Fourth, we examined sub-network protein interaction function categories using 2-dimensional visualization matrixes. We found that significant cellular physiological responses arise from endogeneous, abiotic, and stress-related signals, which correlates well with known facts that internalized cisplatin cause DNA damage and induce cell stress. Fifth and finally, we developed a new visual representation structure for display of activated sub-networks using functional categories as network nodes and their crosstalk as network edges. This type of sub-network further shows that while cell communication and cell growth are generally important to tumor mechanisms, molecular regulation of cell differentiation and development caused by responses to genomic-wide stress seem to be more relevant to the acquisition of drug resistance.
https://doi.org/10.1142/9781860947575_bmatter
AUTHOR INDEX.