High-throughput sequencing and functional genomics technologies have given us the human genome sequence as well as those of other experimentally, medically, and agriculturally important species, and have enabled large-scale genotyping and gene expression profiling of human populations. Databases containing large numbers of sequences, polymorphisms, structures, and gene expression profiles of normal and diseased tissues are being rapidly generated for human and model organisms. Bioinformatics is thus rapidly growing in importance in the annotation of genomic sequences; the understanding of the interplay among and between genes and proteins; the analysis of genetic variability of species; the identification of pharmacological targets; and the inference of evolutionary origins, mechanisms, and relationships. This proceedings volume contains an up-to-date exchange of knowledge, ideas, and solutions to conceptual and practical issues of bioinformatics by researchers, professionals, and industrial practitioners at the 5th Asia-Pacific Bioinformatics Conference held in Hong Kong in January 2007.
Sample Chapter(s)
Chapter 1: Metagenome Analysis Using Megan (613 KB)
https://doi.org/10.1142/9781860947995_fmatter
PREFACE.
APBC2007 ORGANIZATION.
CONTENTS.
https://doi.org/10.1142/9781860947995_0001
There are three groups of extant mammals, two of which abound in Australia. Marsupials (kangaroos and their relatives) and monotremes (echidna and the fabulous platypus) have been evolving independently for most of mammalian history. The genomes of marsupial and monotreme mammals are particularly valuable because these alternative mammals fill a phylogenetic gap in vertebrate species lined up for exhaustive genomic study. Human and mice (~70MY) are too close to distinguish signal, whereas mammal/bird comparisons (~310MY) are too distant to allow alignment. Kangaroos (180 MY) and platypus (210 MY) are just right. Sequence has diverged sufficiently for stringent detection of homologies that can reveal coding regions and regulatory signals. Importantly, marsupials and monotremes share with humans many mammal-specific developmental pathways and regulatory systems such as sex determination, lactation and X chromosome inactivation.
The ARC Centre for Kangaroo Genomics is characterizing the genome of the model Australian kangaroo Macropus eugenii (the tammar wallaby), which is being sequenced by AGRF in Australia, and Baylor (funded by NIH) in the US. We are developing detailed physical and linkage maps of the genome to complement sequencing, and will prepare and array cDNAs for functional studies, especially of reproduction and development. Complete sequencing of the distantly related Brazilian short-tailed opossum Monodelphis domestica by the NIH allows us to compare distantly related marsupials. Sequencing of the genome of the platypus, Ornithorhynchus anatinus by Washington University (funded by the NIH) is complete, and our lab is anchoring contigs to the physical map. We have isolated and completely characterized many BACs and cDNAs containing kangaroo and platypus genes of interest, and demonstrate the value of comparisons to reveal conserved genome organization and function, and new insights in the evolution of the mammalian genome, particularly sex chromosomes.
https://doi.org/10.1142/9781860947995_0002
Rapidly growing evidence suggests that complex and variable interactions between host genetic and systems factors, diet, activity and lifestyle choices, and intestinal microbes control the incidence, severity and complexity of metabolic diseases. The dramatic increase in the world-wide incidence of these diseases, including obesity, diabetes, hypertension, heart disease, and fatty liver disease, raises the need for new ways to maintain health despite inherited and environmental risks. We are pursuing a comprehensive approach based on diet-induced models of metabolic disease. During the course of these studies, new and challenging statistical, analytical and computational problems were discovered. We pioneered a new paradigm for genetic studies based on chromosome substitution strains of laboratory mice. These strains involve systematically substituting each chromosome in a host strain with the corresponding chromosome from a donor strain. A genome survey with these strains therefore involves testing a panel of individual, distinct and non-overlapping genotypes, in contrast to conventional studies of heterogeneous populations. Studies of diet-induced metabolic disease with these strains have already led to striking observations. We discovered that most traits are controlled by a many genetic variants each of which has unexpectedly large phenotypic effects and that act in a highly non-additive manner. The non-additive nature of these variants challenges conventional models of the architecture of complex traits. At every level of resolution from the entire genome to very small genetic intervals, we discovered comparable levels of genetic complexity, suggesting a fractal property of complex traits. Another remarkable property of these large-effect variants is their ability to switch complex systems between alternative phenotypic states such as obese to lean and high to low cholesterol, suggesting that biological traits might be organized in a small number of stable states rather than continuous variability. Moreover, by studying correlations between non-genetic variation in pairs of traits (the genetic control of non-genetic variation), we discovered a new way to dissect the functional architecture of biological systems. Finally, a neglected aspect of these studies of metabolic disease involves the intestingal microbes. Early studies suggest that diet and host physiology affect the numbers and kinds of microbes, and that these microbes in turn affect host metabolism. These interactions between 'bugs, guts and fat' extend systems studies from conventional aspects of genetics and biology to population considerations of the functional interactions between hosts, diet and our microbial passengers. With these models of diet-induced metabolic disease in chromosome substitution strains, we are now positioned find ways to tip complex systems from disease to health.
https://doi.org/10.1142/9781860947995_0003
Advances in tandem mass-spectrometry (MS/MS) steadily increase the rate of generation of MS/MS spectra. As a result, the existing approaches that compare spectra against databases are already facing a bottleneck, particularly when interpreting spectra of modified peptides. We introduce a new idea that allows one to perform MS/MS database search without ever comparing a spectrum against a database. We propose to slightly change the experimental protocol to intentionally generate spectral pairs - pairs of spectra obtained from overlapping (often non-tryptic) peptides or from unmodified and modified versions of the same peptide. While seemingly redundant, spectral pairs open up computational avenues that were never explored before. Having a spectrum of a modified peptide paired with a spectrum of an unmodified peptide, allows one to separate the prefix and suffix ladders, to greatly reduce the number of noise peaks, and to generate a small number of peptide reconstructions that are likely to contain the correct one. The MS/MS database search is thus reduced to extremely fast pattern matching (rather than time-consuming matching of spectra against databases). In addition to speed, our approach provides a new paradigm for identifying post-translational modifications and shotgun protein sequencing via spectral networks analysis. This is a joint work with Nuno Bandeira, Karl Clauser, Ari Frank and Dekel Tsur.
https://doi.org/10.1142/9781860947995_0004
In metagenomics, the goal is to analyze the genomic content of a sample of organisms collected from a common habitat. One approach is to apply large-scale random shotgun sequencing techniques to obtain a collection of DNA reads from the sample. This data is then compared against databases of known sequences such as NCBI-nr or NCBI-nt, in an attempt to identify the taxonomical content of the sample. We introduce a new software called MEGAN (Meta Genome ANalyzer) that generates species profiles from such sequencing data by assigning reads to taxa of the NCBI taxonomy using a straight-forward assignment algorithm. The approach is illustrated by application to a number of datasets obtained using both sequencing-by-synthesis and Sanger sequencing technology, including metagenomic data from a mammoth bone, a portion of the Sargasso sea data set, and several complete microbial test genomes used for validation proposes.
https://doi.org/10.1142/9781860947995_0005
We study the problem of selecting control clones in DNA array hybridization experiments. The problem arises in the OFRG method for analyzing microbial communities. The OFRG method performs classification of rRNA gene clones using binary fingerprints created from a series of hybridization experiments, where each experiment consists of hybridizing a collection of arrayed clones with a single oligonucleotide probe. This experiment produces analog signals, one for each clone, which then need to be classified, that is, converted into binary values 1 and 0 that represent hybridization and non-hybridization events. Besides the sample clones, the array contains a number of control clones needed to calibrate the classification procedure of the hybridization signals. These control clones must be selected with care to optimize the classification process. We formulate this as a combinatorial optimization problem called Balanced Covering. We prove that the problem is ℕℙ-hard and we show some results on hardness of approximation. We propose an approximation algorithm based on randomized rounding and we show that, with high probability, it approximates well the optimum. The experimental results confirm that the algorithm finds high quality control clones. The algorithm has been implemented and is publicly available as part of the software package called CloneTools.
https://doi.org/10.1142/9781860947995_0006
We address the problem of detecting consensus motifs, that occur with subtle variations, across multiple sequences. These are usually functional domains in DNA sequences such as transcriptional binding factors or other regulatory sites. The problem in its generality has been considered difficult and various benchmark data serve as the litmus test for different computational methods. We present a method centered around unsupervised combinatorial pattern discovery. The parameters are chosen using a careful statistical analysis of consensus motifs. This method works well on the benchmark data and is general enough to be extended to a scenario where the variation in the consensus motif includes indels (along with mutations). We also present some results on detection of transcription binding factors in human DNA sequences.
Availability: The system will be made available at http://www.research.ibm.com/computationalgenomics.
https://doi.org/10.1142/9781860947995_0007
In this paper, an effective promoter detection algorithm, which is called PromoterExplorer, is proposed. In our approach, various features, i.e. local distribution of pentamers, positional CpG island features and digitized DNA sequence, are combined to build a high-dimensional input vector. A cascade AdaBoost based learning procedure is adopted to select the most "informative" or "discriminating" features to build a sequence of weak classifiers. A number of weak classifiers construct a strong classifier, which can achieve a better performance. In order to reduce the false positive, a cascade structure is used for detection. PromoterExplorer is tested based on large-scale DNA sequences from different databases, including EPD, Genbank and human chromosome 22. The proposed method consistently outperforms PromoterInspector and Dragon Promoter Finder.
https://doi.org/10.1142/9781860947995_0008
In this paper, we present a new biclustering algorithm to provide the geometrical interpretation of similar microarray gene expression profiles. Different from standard clustering analyses, biclustering methodology can perform simultaneous classification on the row and column dimensions of a data matrix. The main object of the strategy is to reveal the submatrix, in which a subset of genes exhibits a consistent pattern over a subset of conditions. However, the search for such subsets is a computationally complex task. We propose a new algorithm, based on the Hough transform in the column-pair space to perform pattern identification. The algorithm is especially suitable for the biclustering analysis of large-scale microarray data. Our simulation studies show that the method is robust to noise and computationally efficient. Furthermore, we have applied it to a large database of gene expression profiles of multiple human organs and the resulting biclusters show clear biological meanings.
https://doi.org/10.1142/9781860947995_0009
Microarray technologies, which can measure tens of thousands of gene expression values simultaneously in a single experiment, have become a common research method for biomedical researchers. Computational tools to analyze microarray data for biological discovery are needed. In this paper, we investigate the feasibility of using Formal Concept Analysis (FCA) as a tool for microarray data analysis. The method of FCA builds a (concept) lattice from the experimental data together with additional biological information. For microarray data, each vertex of the lattice corresponds to a subset of genes that are grouped together according to their expression values and some biological information related to gene function. The lattice structure of these gene sets might reflect biological relationships in the dataset. Similarities and differences between experiments can then be investigated by comparing their corresponding lattices according to various graph measures. We apply our method to microarray data derived from influenza infected mouse lung tissue and healthy controls. Our preliminary results show the promise of our method as a tool for microarray data analysis.
https://doi.org/10.1142/9781860947995_0010
Biclustering algorithms have emerged as an important tool for the discovery of local patterns in gene expression data. For the case where the expression data corresponds to time-series, efficient algorithms that work with a discretized version of the expression matrix are known. However, these algorithms assume that the biclusters to be found are perfect, in the sense that each gene in the bicluster exhibits exactly the same expression pattern along the conditions that belong to it. In this work, we propose an algorithm that identifies genes with similar, but not necessarily equal, expression patterns, over a subset of the conditions. The results demonstrate that this approach identifies biclusters biologically more significant than those discovered by other algorithms in the literature.
https://doi.org/10.1142/9781860947995_0011
One of the main applications of microarray technology is to determine the gene expression profiles of diseases and disease treatments. This is typically done by selecting a small number of genes from amongst thousands to tens of thousands, whose expression values are collectively used as classification profiles. This gene selection process is notoriously challenging because microarray data normally contains only a very small number of samples, but range over thousands to tens of thousands of genes. Most existing gene selection methods carefully define a function to score the differential levels of gene expression under a variety of conditions, in order to identify top-ranked genes. Such single gene scoring methods suffer because some selected genes have very similar expression patterns so using them all in classification is largely redundant. Furthermore, these selected genes can prevent the consideration of other individually-less but collectively-more differentially expressed genes. We propose to cluster genes in terms of their class discrimination strength and to limit the number of selected genes per cluster. By combining this idea with several existing single gene scoring methods, we show by experiments on two cancer microarray datasets that our methods identify gene subsets which collectively have significantly higher classification accuracies.
https://doi.org/10.1142/9781860947995_0012
We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.
https://doi.org/10.1142/9781860947995_0013
We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d2n2) when considering trees, where no node is of more than degree d. The algorithm developed herein has ranning time O(d9n log n)) which makes it the first algorithm for computing the quartet distance between non-binary trees which has a sub-quadratic worst case ranning time.
https://doi.org/10.1142/9781860947995_0014
Extending the idea of our previous algorithm [17, 18] we developed a new sequential quartet-based phylogenetic tree construction method. This new algorithm reconstructs the phylogenetic tree iteratively by examining at each merge step every possible super-quartet which is formed by four subtrees instead of simple quartet in our previous algorithm. Because our new algorithm evaluates super-quartet trees, each of which may consist of more than four molecular sequences, it can effectively alleviate a traditional, but important problem of quartet errors encountered in the quartet-based methods. Experiment results show that our newly proposed algorithm is capable of achieving very high accuracy and solid consistency in reconstructing the phylogenetic trees on different sets of synthetic DNA data under various evolution circumstances.
https://doi.org/10.1142/9781860947995_0015
Phylogenetic analysis often produce a large number of candidate evolutionary trees, each a hypothesis of the "true" tree. Post-processing techniques such as strict consensus trees are widely used to summarize the evolutionary relationships into a single tree. However, valuable information is lost during the summarization process. A more elementary step is to produce estimates of the topological differences that exist among all pairs of trees. We design a new randomized algorithm, called Hush-RF, that computes the all-to-all Robinson-Foulds (RF) distance—the most common distance metric for comparing two phylogenetic trees. Our approach uses a hash table to organize the bipartitions of a tree, and a universal hashing function makes our algorithm randomized. We compare the performance of our Hash-RF algorithm to PAUP*'s implementation of computing the all-to-all RF distance matrix. Our experiments focus on the algorithmic performance of comparing sets of biological trees, where the size of each tree ranged from 500 to 2,000 taxa and the collection of trees varied from 200 to 1,000 trees. Our experimental results clearly show that our Hash-RF algorithm is up to 500 times faster than PAUP*'s approach. Thus, Hash-RF provides an efficient alternative to a single tree summary of a collection of trees and potentially gives researchers the ability to explore their data in new and interesting ways.
https://doi.org/10.1142/9781860947995_0016
Matching two geometric objects in 2D and 3D spaces is a central problem in computer vision, pattern recognition and protein structure prediction. In particular, the problem of aligning two polygonal chains under translation and rotation to minimize their distance has been studied using various distance measures. It is well known that the Hansdorff distance is useful for matching two point sets, and that the Fréchet distance is a superior measure for matching two polygonal chains. The discrete Fréchet distance closely approximates the (continuous) Fréchet distance, and is a natural measure for the geometric similarity of the folded 3D structures of bio-molecules such as proteins. In this paper, we present new algorithms for matching two polygonal chains in 2D to minimize their discrete Fréchet distance under translation and rotation, and an effective heuristic for matching two polygonal chains in 3D. We also describe our empirical results on the application of the discrete Fréchet distance to the protein structure-structure alignment.
https://doi.org/10.1142/9781860947995_0017
Electron cryo-microscopy (cryo-EM) is an experimental technique to determine the 3-dimensional structure for large protein complexes. Currently this technique is able to generate protein density maps at 6 to 9 Å resolution. Although secondary structures such as α-helix and β-sheet can be visualized from these maps, there is no mature approach to deduce their tertiary topology, the linear order of the secondary structures on the sequence. The problem is challenging because given N secondary structure elements, the number of possible orders is (2N)*N!. We have developed a method to predict the topology of the secondary structures using ab initio structure prediction. The Rosetta structure prediction algorithm was used to make purely sequence based structure predictions for the protein. We produced 1000 of these ab initio models, and then screened the models produced by Rosetta for agreement with the helix skeleton derived from the density map. The method was benchmarked on 60 mainly alpha helical proteins, finding that for about 3/4 of all the proteins, the majority of the helices in the skeleton were correctly assigned by one of the top 10 suggested topologies from the method, while for about 1/3 of all the proteins the best topology assignment without errors was ranked the first. This approach also provides an estimate of the sequence alignment of the skeleton. For most of those true-positive assignments, the alignment was accurate to within +/- 2 amino acids in the sequence.
https://doi.org/10.1142/9781860947995_0018
It is known that folding a protein chain into the cubic lattice is an NP-complete problem. We consider a seemingly easier problem, given a 3D fold of a protein chain (coordinates of its Cα atoms), we want to find the closest lattice approximation of this fold. This problem has been studied under names such as "lattice approximation of a protein chain", "the protein chain fitting problem" and "building protein lattice models". We show that this problem is NP-complete for the cubic lattice with side 3.8Å and the coordinate root mean-square deviation.
https://doi.org/10.1142/9781860947995_0019
This paper proposes algorithms for inferring a chemical structure from a feature vector based on frequency of labeled paths and small fragments, where this inference problem has a potential application to drug design. In this paper, chemical structures are modeled as trees or tree-like structures. It is shown that the inference problems for these kinds of structures can be solved in polynomial time using dynamic programming-based algorithms. Since these algorithms are not practical, a branch-and-bound type algorithm is also proposed. The result of computational experiment suggests that the algorithm can solve the inference problem in a few or few-tens of seconds for moderate size chemical compounds.
https://doi.org/10.1142/9781860947995_0020
A Single Nucleotide Polymorphism (SNP) is a small DNA variation which occurs naturally between different individuals of the same species. Some combinations of SNPs in the human genome are known to increase the risk of certain complex genetic diseases. This paper formulates the problem of identifying such disease-associated SNP motifs as a combinatorial optimization problem and shows it to be -hard. Both exact and heuristic approaches for this problem are developed and tested on simulated data and real clinical data. Computational results are given to demonstrate that these approaches are sufficiently effective to support ongoing biological research.
https://doi.org/10.1142/9781860947995_0021
We study in detail a particular statistical method in genetic case-control analysis, labeled "genotype-based association", in which the two test results from assuming dominant and recessive model are combined in one optimal output. This method differs both from the allele-based association which artificially doubles the sample size, and the direct χ2 test on 3-by-2 contingency table which may overestimate the degree of freedom. We conclude that the comparative advantage (or disadvantage) of the genotype-based test over the allele-based test mainly depends on two parameters, the allele frequency difference δ and the Hardy-Weinberg disequilibrium coefficient difference δ∈. Six different situations, called "phases", characterized by the two X2 test statistics in allele-based and genotype-based test, are well separated in the phase diagram parameterized by δ and δ∈. For two major groups of phases, a single parameter θ = tan−1(δ/δ∈) is able to achieves an almost perfect phase separation. We also applied the analytic result to several types of disease models. It is shown that for dominant and additive models, genotype-based tests are favored over allele-based tests.
https://doi.org/10.1142/9781860947995_0022
Complex systems of interactions govern the structure and function of biomolecules. Mutations that substantially disrupt these interactions are deleterious and should not persist under selection. Yet, several instances have been reported where a variant confirmed as pathogenic in one species is fixed in the orthologs of other species. Here we introduce a novel method for detecting compensatory substitutions for these so-called compensated pathogenic deviations (CPDs), incorporating knowledge of pathogenic variants into a probabilistic method for detecting correlated evolution. The success of this approach is demonstrated for 26 of 31 CPDs observed in mitochondrial transfer RNAs and for one in beta hemoglobin. The detection of multiple compensatory sites is demonstrated for two of these CPDs. The methodology is applicable to comparative sequence data for biomolecules expressed in any alphabet, real or abstract. It provides a widely applicable approach to the prediction of compensatory substitutions for CPDs, avoiding any reliance on rigid non-probabilistic criteria or structural data. The detection of compensatory substitutions that facilitate the substitution of otherwise pathogenic variants offers valuable insight into the molecular constraints imposed on adaptive evolution.
https://doi.org/10.1142/9781860947995_0023
Genomes evolve with both mutations and large scale events, such as inversions, translocations, duplications and losses, that modify the structure of a set of chromosomes. In order to study these types of large-scale events, the first task is to select, in different genomes, sub-sequences that are considered "equivalent". Many approaches have been used to identify equivalent sequences, either based on biological experiments, gene annotations, or sequence alignments. These techniques suffer from a variety of drawbacks that often result in the impossibility, for independent researchers, to reproduce the datasets used in the studies, or to adapt them to newly sequenced genomes. In this paper, we show that carefully selected small probes can be efficiently used to construct datasets. Once a set of probes is identified – and published –, datasets for whole genome comparisons can be produced, and reproduced, with elementary algorithms; decisions about what is considered an occurrence of a probe in a genome can be criticized and reevaluated; and the structure of a newly sequenced genome can be obtained rapidly, without the need of gene annotations or intensive computations.
https://doi.org/10.1142/9781860947995_0024
Gene clusters that span three or more chromosomal regions are of increasing importance, yet statistical tests to validate such clusters are in their infancy. Current approaches either conduct several pairwise comparisons, or consider only the number of genes that occur in all the regions. In this paper, we provide statistical tests for clusters spanning exactly three regions based on genome models of typical comparative genomics problems, including analysis of conserved linkage within multiple species and identification of large-scale duplications. Our tests are the first to combine evidence from genes shared among all three regions and genes shared between pairs of regions. We show that our tests of clusters spanning three regions are more sensitive than existing approaches and can thus be used to identify more diverged homologous regions.
https://doi.org/10.1142/9781860947995_0025
In this paper, we study the exact probability distribution of the number of cycles c in the breakpoint graph of two random genomes with n genes or markers and χ1 and χ2 linear chromosomes, respectively. The genomic distance d between the two genomes is d = n − c. In the limit we find that the expectation of d is .
https://doi.org/10.1142/9781860947995_0026
The total order of the genes or markers on a chromosome is crucial for most comparative genomics studies. However, the current gene mapping efforts might only suffice to provide a partial order of the genes on a chromosome. Several different genes or markers might be mapped at the same position due to the low resolution of gene mapping or missing data. Moreover, conflicting datasets might give rise to the ambiguity of gene order. In this paper, we consider the reversal distance and breakpoint distance problems for partially ordered genomes. We first prove that these problems are NP-hard, and then give an efficient heuristic algorithm to compute the breakpoint distance between partially ordered genomes. The algorithm is based on an efficient approximation algorithm for a natural generalization of the well-known feedback vertex set problem, and has been tested on both simulated and real biological datasets. The experimental results demonstrate that our algorithm is quite effective for estimating the breakpoint distance between partially ordered genomes and for inferring the gene (total) order.
https://doi.org/10.1142/9781860947995_0027
The ability to measure the transcriptional response after a stimulus has drawn much attention to the underlying gene regulatory networks. Several machine learning related methods, such as Bayesian networks and decision trees, have been proposed to deal with this difficult problem, but rarely a systematic comparison between different algorithms has been performed. In this work, we critically evaluate the application of multiple linear regression, SVMs, decision trees and Bayesian networks to reconstruct the budding yeast cell cycle network. The performance of these methods is assessed by comparing the topology of the reconstructed models to a validation network. This validation network is defined a priori and each interaction is specified by at least one publication. We also investigate the quality of the network reconstruction if a varying amount of gene regulatoly dependencies is provided a priori.
https://doi.org/10.1142/9781860947995_0028
This paper proposes a novel clustering method for analyzing biological networks. In this method, each biological network is treated as an undirected graph and edges are weighted based on similarities of nodes. Then, maximal components, which are defined based on edge connectivity, are computed and the nodes are partitioned into clusters by selecting disjoint maximal components. The proposed method was applied to clustering of protein sequences and was compared with conventional clustering methods. The obtained clusters were evaluated using P-values for GO (GeneOntology) terms. The average P-values for the proposed method were better than those for other methods.
https://doi.org/10.1142/9781860947995_0029
Inferring the structure of gene regulatory networks from gene expression data has attracted a growing interest during the last years. Several machine learning related methods, such as Bayesian networks, have been proposed to deal with this challenging problem. However, in many cases, network reconstructions purely based on gene expression data not lead to satisfactory results when comparing the obtained topology against a validation network. Therefore, in this paper we propose an "inverse" approach: Starting from a priori specified network topologies, we identify those parts of the network which are relevant for the gene expression data at hand. For this purpose, we employ linear ridge regression to predict the expression level of a given gene from its relevant regulators with high reliability. Calculated statistical significances of the resulting network topologies reveal that slight modifications of the pruned regulatory network enable an additional substantial improvement.
https://doi.org/10.1142/9781860947995_0030
To identify linear signaling pathways, Scott et al. [RECOMB, 2005] recently proposed to extract paths with high interaction probabilities from protein interaction networks. They used an algorithmic technique known as color-coding to solve this NP-hard problem; their implementation is capable of finding biologically meaningful pathways of length up to 10 proteins within hours. In this work, we give various novel algorithmic improvements for color-coding, both from a worst-case perspective as well as under practical considerations. Experiments on the interaction networks of yeast and fruit fly as well as a testbed of structurally comparable random networks demonstrate a speedup of the algorithm by orders of magnitude. This allows more complex and larger structures to be identified in reasonable time; finding paths of length up to 13 proteins can even be done in seconds and thus allows for an interactive exploration and evaluation of pathway candidates.
https://doi.org/10.1142/9781860947995_0031
This paper presents an improved algorithm for de novo sequencing of multi-charge mass spectra. Recent work based on the analysis of multi-charge mass spectra showed that taking advantage of multi-charge information can lead to higher accuracy (sensitivity and specificity) in peptide sequencing. A simple de novo algorithm, called GBST (Greedy algorithm with Best Strong Tag) was proposed and was shown to produce good results for spectra with charge > 2. In this paper, we analyze some of the shortcomings of GBST. We then present a new algorithm GST-SPC, by extending the GBST algorithm in two directions. First, we use a larger set of multi-charge strong tags and show that this improves the theoretical upper bound on performance. Second, we give an algorithm that computes a peptide sequence that is optimal with respect to shared peaks count from among all sequences that are derived from multi-charge strong tags. Experimental results demonstrate the improvement of GST-SPC over GBST.
https://doi.org/10.1142/9781860947995_0032
Determining glycan structures is vital to comprehend cell-matrix, cell-cell, and even intracellular biological events. Glycan structure sequencing, which is to determine the primary structure of a glycan using MS/MS spectrometry, remains one of the most important tasks in proteomics. Analogous to the peptide de novo sequencing, the glycan de novo sequencing is to determine the structure without the aid of a known glycan database. We show in this paper that glycan de novo sequencing is NP-hard. We then provide a heuristic algorithm and develop a software program to solve the problem in practical cases. Experiments on real MS/MS data of glycopeptides demonstrate that our heuristic algorithm gives satisfactory results on practical data.
https://doi.org/10.1142/9781860947995_0033
A variety of pattern-based methods have been exploited to extract biological relations from literatures. Many of them require significant domain-specific knowledge to build the patterns by hand, or a large amount of labeled data to learn the patterns automatically. In this paper, a semi-supervised model is presented to combine both unlabeled and labeled data for the pattern learning procedure. First, a large amount of unlabeled data is used to generate a raw pattern set. Then it is refined in the evaluating phase by incorporating the domain knowledge provided by a relatively small labeled data. Comparative results show that labeled data, when used in conjunction with the inexpensive unlabeled data, can considerably improve the learning accuracy.
https://doi.org/10.1142/9781860947995_0034
Large-scale protein-protein interactions (PPIs) detected by yeast-two-hybrid (Y2H) systems are known to contain many false positives. The separation of credible interactions from background noise is still an unavoidable task. In the present study, we propose the relative reliability score for PPI as an intrinsic characteristic of global topology in the PPI networks. Our score is calculated as the dominant eigenvector of an adjacency matrix and represents the steady state of the network flow. By using this reliability score as a cut-off threshold from noisy Y2H PPI data, the credible interactions were extracted with better or comparable performance of previously proposed methods which were also based on the network topology. The result suggests that the application of the network-flow model to PPI data is useful for extracting credible interactions from noisy experimental data.
https://doi.org/10.1142/9781860947995_0035
Standard search techniques for DNA repeats start by identifying seeds, that is, small matching words, that may inhabit larger repeats. Recent innovations in seed structure have led to the development of spaced seeds [8] and indel seeds [9] which are more sensitive than contiguous seeds (also known as k-mers, k-tuples, I-words, etc.). Evaluating seed sensitivity requires 1) specifying a homology model which describes types of alignments that can occur between two copies of a repeat, and 2) assigning probabilities to those alignments. Optimal seed selection is a resource intensive activity because essentially all alternative seeds must be tested [7]. Current methods require that the model and probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. In this paper, we show how to eliminatethe need for prior parameter specification. The ideas presented follow from a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe a highly efficient preprocessing step, which is computed just once for each seed. In this calculation, strings which represent possible alignments are unweighted by any probability parameters. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to find optimal seeds and the boundaries within probability space where they are optimal. We expect this method to greatly facilitate the study of seed space sensitivity, construction of multiple seed sets, and the use of alternative definitions of optimality.
https://doi.org/10.1142/9781860947995_0036
We describe an abstract data model of protein structures by representing the geometry of proteins using spatial data types and present a framework for fast structural similarity search based on the matching of topology strings using bipartite graph matching. The system has been implemented on top of the Oracle 9i spatial database management system. The performance evaluation was conducted on 36 proteins from the Chew and Kedem data set and also on a subset of the PDB40. Our method performs well in terms of the quality of matching whilst having the advantage of fast execution and being able to compute similarity search in polynomial time. Thus, this work shows that the pre-computed string representation of topological properties between secondary structure elements using spatial relationships of spatial database management system is practical for fast structural similarity search.
https://doi.org/10.1142/9781860947995_0037
An important tool for analyzing metabolic pathways is being able to do homology searches, that is, for a given pattern network one would like to find occurrences of similar (sub)networks within a set of host networks. In the context of metabolic pathways, Pinter et al. [Bioinformatics, 2005] recently proposed to solve this computationally hard problem by restricting it to the case where both the pattern and host network are trees. This restriction, however, severely limits the applicability of their algorithm. Here, we propose a novel algorithm that does not restrict the topology of the host or pattern network in any way; instead, we exploit a natural property of metabolic networks that we call "local diversity property," which allows us to obtain a very fast and simple algorithm for the alignment of metabolic pathways. Experiments on a testbed of metabolic pathways extracted from the BIOCYC database indicate that our algorithm is much faster than the restricted algorithm of Pinter et al. and yet has a wider range of applicability and yields new biological insights.
https://doi.org/10.1142/9781860947995_0038
G-protein coupled receptors (GPCR) interact with G-proteins to regulate much of the cell's response to external stimuli; abnormalities in which cause numerous diseases. We developed a new method to predict the families of G-proteins with which it interacts, given its residue sequence. We combine both alignment and n-gram features. The former captures long-range interactions but assumes the linear ordering of conserved segments is preserved. The latter makes no such assumption but cannot capture long-range interactions. By combining alignment and n-gram features, and using the entire GPCR sequence (instead of intracellular regions alone, as was done by others), our method outperformed the current state-of-the-art in precision, recall and F1, attaining 0.753 in F1 and 0.796 in accuracy on the PTbase 2004 dataset. Moreover, analysis of our results shows that the majority of coupling specificity information lies in the beginning of the 2nd intracellular loop and over the length of the 3rd.
https://doi.org/10.1142/9781860947995_bmatter
AUTHOR INDEX.
Sample Chapter(s)
Chapter 1: Metagenome Analysis Using Megan (613k)