![]() |
High-throughput sequencing and functional genomics technologies have given us the human genome sequence as well as those of other experimentally, medically, and agriculturally important species, thus enabling large-scale genotyping and gene expression profiling of human populations. Databases containing large numbers of sequences, polymorphisms, structures, metabolic pathways, and gene expression profiles of normal and diseased tissues are rapidly being generated for human and model organisms. Bioinformatics is therefore gaining importance in the annotation of genomic sequences; the understanding of the interplay among and between genes and proteins; the analysis of the genetic variability of species; the identification of pharmacological targets; and the inference of evolutionary origins, mechanisms, and relationships. This proceedings volume contains an up-to-date exchange of knowledge, ideas, and solutions to conceptual and practical issues of bioinformatics by researchers, professionals, and industry practitioners at the 6th Asia-Pacific Bioinformatics Conference held in Kyoto, Japan, in January 2008.
Sample Chapter(s)
Chapter 1: Recent Progress in Phylogenetic Combinatorics (185 KB)
https://doi.org/10.1142/9781848161092_fmatter
PREFACE
APBC 2008 ORGANIZATION
CONTENTS
https://doi.org/10.1142/9781848161092_0001
No abstract received.
https://doi.org/10.1142/9781848161092_0002
KEGG (http://www.genome.jp/kegg/) is a suite of databases that integrates genomic, chemical, and systemic functional aspects of the biological systems. KEGG provides a reference knowledge base for linking genomes to life through the process of PATHWAY mapping, which is to map, for example, a genomic or transcriptomic content of genes to KEGG reference pathways to infer systemic behaviors of the cell or the organism. In addition, KEGG provides a reference knowledge base for linking genomes to the environment, such as for the analysis of drug-target relationships, through the process of BRITE mapping. KEGG BRITE is an ontology database representing functional hierarchies of various biological objects, including molecules, cells, organisms, diseases, and drugs, as well as relationships among them. The KEGG resource is being expanded to suit the needs for practical applications. KEGG PATHWAY now contains 26 pathway maps for human diseases in four subcategories: neurodegenerative disorders, infectious diseases, metabolic disorders, and cancers. Although such maps will continue to be added, they will never be sufficient to represent our knowledge of molecular mechanisms of diseases because in many cases it is too fragmentary to represent as pathways. KEGG DISEASE is a new addition to the KEGG suite accumulating molecular-level knowledge on diseases represented as lists of genes, drugs, biomarkers, etc. KEGG DRUG now covers all approved drugs in the U.S. and Japan. KEGG DRUG is a structure-based database. Each entry is a unique chemical structure that is linked to standard generic names, and is associated with efficacy and target information as well as drug classifications. Target information is presented in the context of KEGG pathways and drug classifications are part of KEGG BRITE. The generic names are linked to trade names and subsequently to outside resources of package insert information whenever available. This reflects our effort to make KEGG more useful to the general public.
https://doi.org/10.1142/9781848161092_0003
To assess the feasibility of extracting protein interactions from text we have recently organized the BIOCREATVE II challenge (http://biocreative.sourceforge.net) in collaboration with the MINT and INTACT databases. The competition was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. 20 teams participated in the competition that used full text and the information on interactions, sentences and experimental vocabularies provided by the associated databases. The results showed quite promising results and clearly pointed to the main challenges setting the path for future research. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first text mining meta-server (the complete set of BioCreative papers to be published in a special issue of Genome Biology).
Regarding the extraction of information on protein interactions from genomic information along the years my group and others have contributed to the developed of a set of methods based on the concept of concerted evolution between interacting protein families. In a new effort we have recently developed a completely new approach that uses the full power of co-evolution to integrate information from complete collections of protein families.
https://doi.org/10.1142/9781848161092_0004
We introduce a general framework for string kernels. This framework can produce various types of kernels, including a number of existing kernels, to be used with support vector machines (SVMs). In this framework, we can select the informative subsequences to reduce the dimensionality of the feature space. We can model the mutations in biological sequences. Finally, we combine contributions of subsequences in a weighted fashion to get the target kernel. In practical computation, we develop a novel tree structure, coupled with a traversal algorithm to speed up the computation. The experimental results on a benchmark SCOP data set show that the kernels produced by our framework outperform the existing spectrum kernels, in both efficiency and ROC50 scores.
https://doi.org/10.1142/9781848161092_0005
The intra-nuclear organisation of proteins is based on possibly transient interactions with morphologically defined compartments like the nucleolus. The fluidity of trafficking challenges the development of models that accurately identify compartment membership for novel proteins. A growing inventory of nucleolar proteins is here used to train a support-vector machine to recognise sequence features that allow the automatic assignment of compartment membership. We explore a range of sequence-kernels and find that while some success is achieved with a profile-based local alignment kernel, the problem is ill-suited to a standard compartment-classification approach.
https://doi.org/10.1142/9781848161092_0006
In the past decade, many automated prediction methods for the subcellular localization of proteins have been proposed, utilizing a wide range of principles and learning approaches. Based on an experimental evaluation of different methods and on their theoretical properties, we propose to combine a well balanced set of existing approaches to new, ensemble-based prediction methods. The experimental evaluation shows our ensembles to improve substantially over the underlying base methods.
https://doi.org/10.1142/9781848161092_0007
In this paper we propose new methods of chemical structure classification based on the integration of graph database mining from data mining and graph kernel functions from machine learning. In our method, we first identify a set of general graph patterns in chemical structure data. These patterns are then used to augment a graph kernel function that calculates the pairwise similarity between molecules. The obtained similarity matrix is used as input to classify chemical compounds via a kernel machines such as the support vector machine (SVM). Our results indicate that the use of a pattern-based approach to graph similarity yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art approaches. In addition, the identification of highly discriminative patterns for activity classification provides evidence that our methods can make generalizations about a compound's function given its chemical structure. While we evaluated our methods on molecular structures, these methods are designed to operate on general graph data and hence could easily be applied to other domains in bioinformatics.
https://doi.org/10.1142/9781848161092_0008
The inverse protein folding problem is that of designing an amino acid sequence which has a prescribed native protein fold. This problem arises in drug design where a particular structure is necessary to ensure proper protein-protein interactions. The input to the inverse protein folding problem is a shape and the goal is to design a protein sequence with a unique native fold that closely approximates the input shape. Gupta et al.1 introduced a design in the 2D HP model of Dill that can be used to approximate any given (2D) shape. They conjectured that the protein sequences of their design are stable but only proved the stability for an infinite class of very basic structures. The HP model divides amino acids to two groups: hydrophobic (H) and polar (P), and considers only hydrophobic interactions between neighboring H amino in the energy formula. Another significant force acting during the protein folding are sulfide (SS) bridges between two cysteine amino acids. In this paper, we will enrich the HP model by adding cysteines as the third group of amino acids. A cysteine monomer acts as an H amino acid, but in addition two neighboring cysteines can form a bridge to further reduce the energy of the fold. We call our model the HPC model. We consider a subclass of linear structures designed in Gupta et al.1 which is rich enough to approximate (although more coarsely) any given structure. We refine the structures for the HPC model by setting approximately a half of H amino acids to cysteine ones. We conjecture that these structures are stable under the HPC model and prove it under an additional assumption that non-cysteine amino acids act as cysteine ones, i.e., they tend to form their own bridges to reduce the energy. In the proof we will make an efficient use of a computational tool 2DHPSolver which significantly speeds up the progress in the technical part of the proof. This is a preliminary work, and we believe that the same techniques can be used to prove this result without the artificial assumption about non-cysteine H monomers.
https://doi.org/10.1142/9781848161092_0009
Graph theoretic properties of proteins can be used to perceive the differences between correctly folded proteins and well designed decoy sets. 3D protein structures of proteins are represented with graphs. We used two different graph representations: Delaunay tessellations of proteins and contact map graphs. Graph theoretic properties for both graph types showed high classification accuracy for protein discrimination. Fisher, linear, quadratic, neural network, and support vector classifiers were used for the classification of the protein structures. The best classifier accuracy was over 98%. Results showed that characteristic features of graph theoretic properties can be used in the detection of native folds.
https://doi.org/10.1142/9781848161092_0010
To assess the physico-chemical characteristics of protein-protein interactions, protein sequences and overall structural folds have been analyzed previously. To highlight this, discovery and examination of amino acid patterns at the binding sites defined by structural proximity in 3-dimensional (3D) space are essential. In this paper, we investigate the interacting preferences of 3D pattern pairs discovered separately in transient and obligate protein complexes. These 3D pattern pairs are not necessarily sequence-consecutive, but each residue in two groups of amino acids from two proteins in a complex is within certain Å threshold to most residues in the other group. We develop an algorithm called AA-pairs by which every pair of interacting proteins is represented as a bipartite graph, and it discovers all maximal quasi-bicliques from every bipartite graph to form our 3D pattern pairs. From 112 and 2533 highly conserved 3D pattern pairs discovered in the transient and obligate complexes respectively, we observe that Ala and Leu is the highest occuring amino acid in interacting 3D patterns of transient (20.91%) and obligate (33.82%) complexes respectively. From the study on the dipeptide composition on each side of interacting 3D pattern pairs, dipeptides Ala-Ala and Ala-Leu are popular in 3D patterns of both transient and obligate complexes. The interactions between amino acids with large hydrophobicity difference are present more in the transient than in the obligate complexes. On contrary, in obligate complexes, interactions between hydrophobic residues account for the top 5 most occuring amino acid pairings.
https://doi.org/10.1142/9781848161092_0011
Structural bioinformatics provides new tools for investigating protein-protein interactions at the molecular level. We present two types of structural descriptors for efficiently representing and comparing protein-protein binding sites and surface patches. The descriptors are based on distributions of distances between five types of functional atoms, thereby capturing the spatial arrangement of physico-chemical properties in 3D space. Experiments with the method are performed on two tasks: (1) detection of binding sites with known similarity from homologous proteins, and (2) scanning of the surfaces of two non-homologous proteins for similar regions.
https://doi.org/10.1142/9781848161092_0012
In this paper, we consider the problem of structural alignment of a target RNA sequence of length n and a query RNA sequence of length m with known secondary structure that may contain embedded simple pseduoknots. The best known algorithm for solving this problem (Dost et al. [13]) runs in 0(mn4) time with space complexity of O(mn3), which requires too much memory making it infeasible for comparing ncRNAs (non-coding RNAs) with length several hundreds or more. We propose a memory efficient algorithm to solve the same problem. We reduce the space complexity to O(mn2 + n3) while maintaining the same time complexity of Dost et al.'s algorithm. Experimental reslts show that our algorithm is feasible for comparing ncRNAs of length more than 500.
https://doi.org/10.1142/9781848161092_0013
Genomic sequence alignment is a powerful tool for finding common subsequence patterns shared by the input sequences and identifying evolutionary relationships between the species. However, the running time and space requirement of genome alignment have often been very extensive. In this research, we propose a novel algorithm called Coarse-Grained AlignmenT (CGAT) algorithm, for reducing computational complexity necessary for cross-species whole genome sequence alignment. The CGAT first divides the input sequences into “blocks” with a fixed length and aligns these blocks to each other. The generated block-level alignment is then refined at the nucleotide level. This two-step procedure can drastically reduce the overall computational time and space necessary for an alignment. In this paper, we show the effectiveness of the proposed algorithm by applying it to whole genome sequences of several bacteria.
https://doi.org/10.1142/9781848161092_0014
As the sequence identity between a pair of proteins decreases, alignment strategies that are based on sequence and/or sequence profiles become progressively less effective in identifying the correct structural correspondence between residue pairs. This significantly reduces the ability of comparative modeling-based approaches to build accurate structural models. Incorporating into the alignment process predicted information about the local structure of the protein holds the promise of significantly improving the alignment quality of distant proteins. This paper studies the impact on the alignment quality of a new class of predicted local structural features that measure how well fixed-length backbone fragments centered around each residue-pair align with each other. It presents a comprehensive experimental evaluation comparing these new features against existing state-of-the-art approaches utilizing profile-based and predicted secondary-structure information. It shows that for protein pairs with low sequence similarity (less than 12% sequence identity) the new structural features alone or in conjunction with profile-based information lead to alignments that are considerably better than those obtained by previous schemes.
https://doi.org/10.1142/9781848161092_0015
Transition seeds exhibit a good tradeoff between sensitivity and specificity for homology search in both coding and non-coding regions. But, identifying good transition seeds is extremely hard. We study the hit probability of high-order seed patterns. Based on our theoretical results, we propose an efficient method for ranking transition seeds for seed design.
https://doi.org/10.1142/9781848161092_0016
Spaced seed is a filter method invented to efficiently identify the regions of interest in similarity searches. It is now well known that certain spaced seeds hit (detect) a randomly sampled similarity region with higher probabilities than the others. Assume each position of the similarity region is identity with probability p independently. The seed optimization problem seeks for the optimal seed achieving the highest hit probability with given length and weight. Despite that the problem was previously shown not to be NP-hard, in practice it seems difficult to solve. The only algorithm known to compute the optimal seed is still exhaustive search in exponential time. In this article we put some insight into the hardness of the seed design problem by demonstrating the relation between the seed optimization problem and the optimal Golomb ruler design problem, which is a well known difficult problem in combinatorial design.
https://doi.org/10.1142/9781848161092_0017
Many efforts at standardising terminologies within the biological domain have resulted in the construction of hierarchical controlled vocabularies that capture domain knowledge. Vocabularies, such as the PSI-MI vocabulary, capture both deep and extensive domain knowledge, in the OBO (Open Biomedical Ontologies) format. However hierarchical vocabularies, such as PSI-MI which are represented in OBO, only represent simple parent-child relationships between terms. By contrast, ontologies constructed using the Web Ontology Language (OWL), such as BioPax, define many richer types of relationships between terms. OWL provides a semantically rich structured language for describing classes and sub-classes of entities and properties, relationships between them and expressing domain-specific rules or axioms that can be applied to extract new information through semantic inference. In order to fully exploit the domain knowledge inherent in domain-specific controlled vocabularies, they need to be represented as OWL-DL ontologies, rather than in formats such as OBO. In this paper, we describe a method for converting OBO vocabularies into OWL and class instances represented as OWL-RDF triples. This approach preserves the hierarchical arrangement of the domain knowledge whilst also making the underlying parent-child relationships available to inference engines. This approach also has clear advantages over existing methods which incorporate terms from external controlled vocabularies as literals stripped of the context associated with their place in the hierarchy. By preserving this context, we enable machine inference over the ordered domain knowledge captured in OBO controlled vocabularies.
https://doi.org/10.1142/9781848161092_0018
The similarity of two gene products can be used to solve many problems in information biology. Since one gene product corresponds to several GO (Gene Ontology) terms, one way to calculate the gene product similarity is to use the similarity of their GO terms. This GO term similarity can be defined as the semantic similarity on the GO graph. There are many kinds of similarity definitions of two GO terms, but the information of the GO graph is not used efficiently. This paper presents a new way to mine more information of the GO graph by regarding edge as information content and using the information of negation on the semantic graph. A simple experiment is conducted and, as a result, the accuracy increased by 8.3 percent in average, compared with the traditional method which uses node as information source.
https://doi.org/10.1142/9781848161092_0019
We present a new direction of research, which deploys Text Mining technologies to construct and maintain data bases organized in the form of pathway, by associating parts of papers with relevant portions of a pathway and vice versa. In order to materialize this scenario, we present two annotated corpora. The first, Event Annotation, identifies the spans of text in which biological events are reported, while the other, Pathway Annotation, associates portions of papers with specific parts in a pathway.
https://doi.org/10.1142/9781848161092_0020
Protein sequences contain great potential revealing protein function, structure families and evolution information. Classifying protein sequences into different functional groups or families based on their sequence patterns has attracted lots of research efforts in the last decade. A key issue of these classification systems is how to interpret and represent protein sequences, which largely determines the performance of classifiers. Inspired by text classification and Chinese word segmentation techniques, we propose a segmentation-based feature extraction method. The extracted features include selected words, i.e., substrings of the sequences, and also motifs specified in public database. They are segmented out and their occurrence frequencies are recorded as the feature vector values. We conducted experiments on two protein data sets. One is a set of SCOP families, and the other is GPCR family. Experiments in classification of SCOP protein families show that the proposed method not only results in an extremely condensed feature set but also achieves higher accuracy than the methods based on whole k-spectrum feature space. And it also performs comparably to the most powerful classifiers for GPCR level I and level II subfamily recognition with 92.6 and 88.8% accuracy, respectively.
https://doi.org/10.1142/9781848161092_0021
Many RNA functions are determined by their specific secondary and tertiary structures. These structures are folded by the canonical G::C and A::U base pairings as well as by the non-canonical G::U complementary bases. G::U base pairings in RNA secondary structures may induce structural asymmetries between the transcribed and non-transcribed strands in their corresponding DNA sequences. This is likely so because the corresponding C::A nucleotides of the complementary strand do not pair. As a consequence, the secondary structures that form from a genomic sequence depend on the strand transcribed. We explore this idea to investigate the size and significance of both global and local secondary structure formation differentials in several non-coding RNA families and mRNAs. We show that both thermodynamic stability of global RNA structures in the transcribed strand and RNA structure strand asymmetry are statistically stronger than that in randomized versions preserving the same di-nucleotide base composition and length, and is especially pronounced in microRNA precursors. We further show that a measure of local structural strand asymmetry within a fixed window size, as could be used in detecting and characterizing transcribed regions in a full genome scan, can be used to predict the transcribed strand across ncRNA families.
https://doi.org/10.1142/9781848161092_0022
Non-coding RNAs (ncRNAs) are transcripts that do not code for proteins. Recent findings have shown that RNA-mediated regulatory mechanisms influence a substantial portion of typical microbial genomes. We present an efficient method for finding potential ncRNAs in bacteria by clustering genomic sequences based on homology inferred from both primary sequence and secondary structure. We evaluate our approach using a set of Firmicutes sequences, and the results show promise for discovering new ncRNAs.
https://doi.org/10.1142/9781848161092_0023
Clustering objects with respect to a given similarity or distance measure is a problem often encountered in computational biology. Several well-known clustering algorithms are based on transforming the input matrix into a weighted graph although the resulting WEIGHTED CLUSTER EDITING problem is computationally hard: here, we transform the input graph into a disjoint union of cliques such that the sum of weights of all modified edges is minimized.
We present fixed-parameter algorithms for this problem which guarantee to find an optimal solution in provable worst-case running time. We introduce a new data reduction operation (merging vertices) that has no counterpart in the unweighted case and strongly cuts down running times in practice. We have applied our algorithms to both artificial and biological data. Despite the complexity of the problem, our method often allows exact computation of optimal solutions in reasonable running time.
https://doi.org/10.1142/9781848161092_0024
This paper proposes series of methods for measuring the similarity of protein structures. In the proposed methods, an original protein structure is transformed into a distance matrix, which is regarded as a two-dimensional image. Then, the similarity of two protein structures is measured by a kind of compression ratio of the concatenated image. We employed several image compression algorithms: JPEG, GIF, PNG, IFS, and SPC, and audio compression algorithms: MP3 and FLAC. We applied the proposed method to clustering of protein structures. The results of computational experiments suggest that SPC has the best performance.
https://doi.org/10.1142/9781848161092_0025
The genome halving problem, previously solved by El-Mabrouk for inversions and reciprocal translocations, is here solved in a more general context allowing transpositions and block interchange as well, for genomes including multiple linear and circular chromosomes. We apply this to several data sets and compare the results to the previous algorithm.
https://doi.org/10.1142/9781848161092_0026
Rapidly increasing numbers of organisms have been completely sequenced and most of their genes identified; homologies among these genes are also getting established. It thus has become possible to represent whole genomes as ordered lists of gene identifiers and to study the evolution of these entities through computational means, in systematics as well as in comparative genomics. While dealing with rearrangements is nontrivial, the biggest stumbling block remains gene duplication and losses, leading to considerable difficulties in determining orthologs among gene families—all the more since orthology determination has a direct impact on the selection of rearrangements. None of the existing phylogenetic reconstruction methods that use gene orders is able to exploit the information present in complete gene families—most assume singleton families and equal gene content, limiting the evolutionary operations to rearrangements, while others make it so by eliminating nonshared genes and selecting one exemplar from each gene family. In this work, we leverage our past work on genomic distances, on tight bounding of parsimony scores through linear programming, and on divide-and-conquer methods for large-scale reconstruction to build the first computational approach to phylogenetic reconstruction from complete gene order data, taking into account not only rearrangements, but also duplication and loss of genes. Our approach can handle mulitchromosomal data and gene families of arbitrary sizes and scale up to hundreds of genomes through the use of disk-covering methods. We present experimental results on simulated unichromosomal genomes in a range of sizes consistent with prokaryotes. Our results confirm that equalizing gene content, as done in existing phylogenetic tools, discards important phylogenetic information; in particular, our approach easily outperforms that most commonly referenced tool, MGR, often returning trees with less that one quarter of the errors found in the MGR trees.
https://doi.org/10.1142/9781848161092_0027
The SPR (subtree prune and regraft) operation is used as the basis for reconciling incongruent phylogenetic trees, particularly for detecting and analyzing non-treelike evolutionary histories such as horizontal gene transfer, hybrid speciation, and recombination. The SPR-based tree reconciliation problem has been shown to be NP-hard, and several efficient heuristics have been designed to solve it. A major drawback of these heuristics is that for the most part they do not handle non-binary trees appropriately. Further, their computational efficiency suffers significantly when computing multiple optimal reconciliations. In this paper, we present algorithmic techniques for efficient SPR-based reconciliation of trees that are not necessarily binary. Further, we present divide-and-conquer approaches that enable efficient computing of multiple optimal reconciliations. We have implemented our techniques in the PhyloNet software package, which is publicly available at http://bioinfo.cs.rice.edu. The resulting method outperforms all existing methods in terms of speed, and performs at least as well as those methods in terms of accuracy.
https://doi.org/10.1142/9781848161092_0028
In addition to the well-known edit operations, the alignment of minisatellite maps includes duplication events. We model these duplications using a special kind of spanning trees and deduce an optimal duplication scenario by computing the respective minimum spanning tree. Based on best duplication scenarios for all substrings of the given sequences, we compute an optimal alignment of two minisatellite maps. Our algorithm improves upon the previously developed algorithms in the generality of the model, in alignment quality and in space-time efficiency. Using this algorithm, we derive evidence that there is a directional bias in the growth of minisatellites of the MSY1 dataset.
https://doi.org/10.1142/9781848161092_0029
We introduce a comparative analysis of metabolic reaction networks between different species. Our method systematically investigates full metabolic networks of multiple species at the same time, with the goal of identifying highly similar yet non-identical pathways which execute the same metabolic function, i.e. the transformation of a specific substrate into a certain end product via similar reactions. We present a clear framework for matching metabolic pathways, and propose a scoring scheme which combines enzyme functional similarity with protein sequence similarity. This analysis helps to gain insight in the biological differences between species and provides comprehensive information on diversity in pathways between species and alternative pathways within species, which is useful for pharmaceutical and industrial bioengineering targets. The results also generate hypotheses for improving current metabolic networks or constructing such networks for currently unannotated species.
https://doi.org/10.1142/9781848161092_0030
This paper presents a novel method for recovering signaling pathways from protein-protein interaction networks automatically. Given an undirected weighted protein interaction network, finding signaling pathways is treated as searching for the optimal subnetworks from the network according to some cost function. To approach this optimum problem, an integer linear programming model is proposed in this work to model the signal pathways from the protein interaction network. The numerical results on three known yeast MAPK signal pathways demonstrate the efficiency and effectiveness of the proposed method.
https://doi.org/10.1142/9781848161092_0031
We present a new approach to segmenting multiple time series by analyzing the dynamics of cluster rearrangement around putative segment boundaries. By directly minimizing information-theoretic measures of segmentation quality derived from Kullback-Leibler (KL) divergences, our formulation reveals clusters of genes along with a segmentation such that clusters show concerted behavior within segments but exhibit significant regrouping across segmentation boundaries. This approach finds application in distilling large numbers of gene expression profiles into temporal relationships underlying biological processes. The results of the segmentation algorithm can be summarized as Gantt charts revealing temporal dependencies in the ordering of key biological processes. Applications to the yeast metabolic cycle and the yeast cell cycle are described.
https://doi.org/10.1142/9781848161092_0032
We present algorithms for finding control strategies in Boolean Networks (BN). Our approach uses symbolic techniques from the field of model checking. We show that despite recent hardness-results for finding control policies, a model checking-based approach is often capable of scaling to extremely large and complex models. We demonstrate the effectiveness of our approach by applying it to a BN model of embryogenesis in D. melanogaster with 15,360 Boolean variables.
https://doi.org/10.1142/9781848161092_0033
Estimations of population genetic parameters like allele frequencies, heterozygosities, inbreeding coefficients and genetic distances rely on the assumption that all sampled genotypes come from a randomly interbreeding population or sub-population. Here we show that small cross-generational samples may severely affect estimates of allele frequencies, when a small number of progenies dominate the next generation or the sample. A new estimator of allele frequencies is developed for such cases when the kin structure of the focal sample is unknown and has to be assessed simultaneously. Using Monte Carlo simulations it was demonstrated that the new estimator delivered significant improvement over the conventional allele-counting estimator.
https://doi.org/10.1142/9781848161092_0034
In this paper, we develop a probabilistic model to approach two scenarios in reality about the singular haplotype reconstruction problem - the incompleteness and inconsistency occurred in the DNA sequencing process to generate the input haplotype fragments and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.
https://doi.org/10.1142/9781848161092_0035
Finding motifs and the corresponding binding sites is a critical and challenging problem in studying the process of gene expression. String and matrix representations are two popular models to represent a motif. However, both representations share an important weakness by assuming that the occurrence of a nucleotide in a binding site is independent of other nucleotides. More complicated representations, such as HMM or regular expression, exist that can capture the nucleotide dependency. Unfortunately, these models are not practical (with too many parameters and require many known binding sites). Recently, Chin and Leung introduced the SPSP representation which overcomes the limitations of these complicated models. However, discovering novel motifs in SPSP representation is still a NP-hard problem. In this paper, based on our observations in real binding sites, we propose a simpler model, the Dependency Pattern Sets (DPS) representation, which is simpler than the SPSP model but can still capture the nucleotide dependency. We develop a branch and bound algorithm (DPS-Finder) for finding optimal DPS motifs. Experimental results show that DPS-Finder can discover a length-10 motif from 22 length-500 DNA sequences within a few minutes and the DPS representation has a similar performance as SPSP representation.
https://doi.org/10.1142/9781848161092_0036
Primer Approximation Multiplex PCR (PAMP) is a recently introduced experimental technique for detecting large-scale cancer genome lesions such as inversions and deletions from heterogeneous samples containing a mixture of cancer and normal cells. In this paper we give integer linear programming formulations for the problem of selecting sets of PAMP primers that minimize detection failure probability. We also show that PAMP primer selection for detection of anchored deletions cannot be approximated within a factor of 2 − ɛ, and give a 2-approximation algorithm for a special case of the problem. Experimental results show that our ILP formulations can be used to optimally solve medium size instances of the inversion detection problem, and that heuristics based on iteratively solving ILP formulations for a one-sided version of the problem give near-optimal solutions for anchored deletion detection with highly scalable runtime.
https://doi.org/10.1142/9781848161092_0037
We have developed a generic framework for combining introns from genomicly aligned expressed–sequence–tag clusters with a set of exon predictions to produce alternative transcript predictions. Our current implementation uses ASPIC to generate alternative transcripts from EST mappings. Introns from ASPIC and a set of gene predictions from many diverse gene prediction programs are given to the gene prediction combiner GenePC which then generates alternative consensus splice forms. We evaluated our method on the ENCODE regions of the human genome. In general we see a marked improvement in transcript-level sensitivity due to the fact that more than one transcript per gene may now be predicted. GenePC, which alone is highly specific at the transcript level, balances the lower specificity of ASPIC.
https://doi.org/10.1142/9781848161092_0038
High-resolution spatial information on gene expression over time can be acquired through whole mount in-situ hybridisation experiments in animal model species such as fish, fly or mouse. Expression patterns of many genes have been studied and data has been integrated into dedicated model organism databases like ZFIN for zebrafish, MEPD for medaka, BDGP for drosophila or MGI for mouse. Nevertheless, a central repository that allows users to query and compare gene expression patterns across different species has not yet been established. For this purpose we have integrated gene expression data for zebrafish, medaka, drosophila and mouse into a central public repository named 4DXpress (http://ani.embl.de/4DXpress). 4DXpress allows to query anatomy ontology based expression annotations across species and quickly jump from one gene to the orthologs in other species based on ensembl-compara relationships. We have set up a linked resource for microarray data at ArrayExpress. In addition we have mapped developmental stages between the species to be able to compare corresponding developmental time phases. We have used clustering algorithms to classify genes based on their expression pattern annotations. To illustrate the use of 4DXpress we systematically analysed the relationships between conserved regulatory inputs and spatio-temporal gene expression derived from 4DXpress and found significant correlation between expression patterns of genes predicted to have similar regulatory elements in their promoters.
https://doi.org/10.1142/9781848161092_0039
DNA replication is a key process in cell division cycle. It is initiated in coordinated manner in several species. To understand the DNA replication in a species one needs to measure the half replication timing (or replication timing) and the efficiency of replication which vary across genome in higher eukaryotes. In the previous studies, no direct assessment of replication efficiency on a genomic scale was performed while the replication timing was indirectly assessed using average DNA. In this paper, we present a first-ever-method of directly measuring both half replication timing and efficiency simultaneously from a single DNA microarray time-course data. We achieve it by fitting the so called near-sigmoid model to each locus of the DNA. We use this model apply S. pombe DNA replication microarray data and show that it is effective for genome-scale replication timing and efficiency profiling studies.
https://doi.org/10.1142/9781848161092_bmatter
AUTHOR INDEX
Sample Chapter(s)
Chapter 1: Recent Progress in Phylogenetic Combinatorics (185k)