High-throughput sequencing and functional genomics technologies have given us a draft human genome sequence and have enabled large-scale genotyping and gene expression profiling of human populations. Databases containing large numbers of sequences, polymorphisms, and gene expression profiles of normal and diseased tissues in different clinical states are rapidly being generated for human and model organisms. Bioinformatics is thus rapidly growing in importance in the annotation of genomic sequences, in the understanding of the interplay between genes and proteins, in the analysis of the genetic variability of species, and so on. This proceedings contains an up-to-date exchange of knowledge, ideas, and solutions to conceptual and practical issues of bioinformatics, by researchers, professionals, and industrial practitioners at the 4th Asia-Pacific Bioinformatics Conference held in Taipei in February 2006.
https://doi.org/10.1142/9781860947292_fmatter
PREFACE
APBC2006 ORGANIZATION
CONTENTS
https://doi.org/10.1142/9781860947292_0001
Advances in genomics have led to the production of various functional genomic data as well as genomic sequence data. This is particularly true in yeasts. Such data have proved to be highly useful for inferring regulatory elements and modules. I shall present studies that I have done with my colleagues and collaborators on the following topics. (1) Detection of transcription factors (including their interactions) involved in a specific function such as the cell cycle, (2) inference of the cis elements (binding sites and sequences) of a transcription factor, (3) reconstruction of the regulatory circuits of genes, and (4) inference of regulatory modules. In all these topics, we have developed methods and have applied them to analyze data from yeasts.
https://doi.org/10.1142/9781860947292_0002
Most genes have attained their observed distribution among genomes by transmission from parent to offspring through time. In prokaryotes (bacteria and archaea), however, some genes are where they are as the result of transfer from an unrelated lineage. To elucidate the biological origins and functional consequences of lateral gene transfer (LGT), we have constructed an automated computational pipeline to recognise protein families among prokaryotic genomes, generate high-quality multiple sequence alignments of orthologs, infer statistically sound phylogenetic trees, and find topologically incongruent subtrees (prima facie instances of LGT). This pipeline requires that we automate workflows, design and optimize algorithms, mobilise high-performance computing resources, and efficiently manage federated data. I will summarise results from the automated comparison of 422971 proteins in 22437 families across 144 sequenced prokaryotic genomes, including the nature and extent of LGT among these lineages, major donors and recipients, the biochemical pathways and physiological functions most affected, and implications for the role of LGT in evolution of biochemical pathways.
https://doi.org/10.1142/9781860947292_0003
An innovative new technology, optical mapping, is used to infer the genome map of the location of short sequence patterns called restriction sites. The technology, developed by David Schwartz, allows the visualization of the maps of randomly located single molecules around a million base pairs in length. The genome map is constructed from overlapping these shorter maps. The mathematical and computational challenges come from modeling the measurement errors and from the process of map assembly.
https://doi.org/10.1142/9781860947292_0004
The full sibship reconstruction (FSR) problem is the problem of inferring all groups of full siblings from a given population sample using genetic marker data without parental information. The FSR problem remains a significant challenge for computational biology, since an exact solution for the problem has not been found. The new algorithm, named SIMPSON-assisted Descending Ratio (SDR), is devised combining a new Simpson index based O(n2) algorithm (MS2) and the existing Descending Ratio (DR) algorithm. The SDR algorithm outperforms the SIMPSON, MS2, and DR algorithms in accuracy and robustness when tested on a variety of sample family structures. The accuracy error is measured as the percentage of incorrectly assigned individuals. The robustness of the FSR algorithms is assessed by simulating a 2% mutation rate per locus (a 1% rate per allele).
https://doi.org/10.1142/9781860947292_0005
Recent developments in large-scale monitoring of gene expression such as DNA microarrays have made the reconstruction of gene regulatory networks (GRNs) feasible. Before one can infer the structures of these networks, it is important to identify, for each gene in the network, which genes can affect its expression and how they affect it. Most of the existing approaches are useful exploratory tools in the sense that they allow the user to generate biological hypotheses about transcriptional regulations of genes that can then be tested in the laboratory. However, the patterns discovered by these approaches are not adequate for making accurate prediction on gene expression patterns in new or held-out experiments. Therefore, it is difficult to compare performance of different approaches or decide which approach is likely to generate plausible hypothesis. For this reason, we need an approach that not only can provide interpretable insight into the structures of GRNs but also can provide accurate prediction. In this paper, we present a novel fuzzy logic-based approach for this problem. The desired characteristics of the proposed algorithm are as follows: (i) it is able to directly mine the high-dimensional expression data without the need for additional feature selection procedures, (ii) it is able to distinguish between relevant and irrelevant expression data in predicting the expression patterns of predicted genes, (iii) based on the proposed objective interestingness measure, no user-specified thresholds are needed in advance, (iv) it can make explicit hidden patterns discovered for possible biological interpretation, (v) the discovered patterns can be used to predict gene expression patterns in other unseen tissue samples, and (vi) with fuzzy logic, it is robust to noise in the expression data as it hides the boundaries of the adjacent intervals of the quantitative attributes. Experimental results on real expression data show that it can be very effective and the discovered patterns reveal biologically meaningful regulatory relationships of genes that could help the user reconstructing the underlying structures of GRNs.
https://doi.org/10.1142/9781860947292_0006
The circadian regulatory network is one of the main topics of plant investigations. The intracellular interactions among genes in response to the environmental stimuli of light are related to the foundation of functional genomics in plant. However, the sensitivity analysis of the circadian system has not analyzed by perturbed stochastic dynamic model via microarray data in plant. In this study, the circadian network is constructed for Arabidopsis thaliana using a stochastic dynamic model with sigmoid interaction, activation delay, and regulation of input light taken into consideration. The describing function method in nonlinear control theory about nonlinear limit cycle (oscillation) is employed to interpret the oscillations of the circadian regulatory networks from the viewpoint that nonlinear network will continue to oscillate if its feedback loop gain is equal to 1 to support the oscillation of circadian network. Based on the dynamic model via microarray data, the system sensitivity analysis is performed to assess the robustness of circadian regulatory network via biological perturbations. We found that the circadian network is more sensitive to the perturbation of the trans-expression threshold, is more sensitive to the activation level of steady state, rather than the trans-sensitivity rate.
https://doi.org/10.1142/9781860947292_0007
We present a new program for predicting protein subcellular localization from amino acid sequence. WoLF PSORT is a major update to the PSORTII program, based on new sequence data and incorporating new features with a feature selection procedure. Following SWISS-PROT, we divided eukaryotes into three groups: fungi, plant, and animal. For the 2113 fungi proteins divided into 14 categories; we found that, combined with BLAST, WoLF PSORT yields a cross-validated accuracy of 83%, eliminating about 1/3 of the errors made when using BLAST alone. For 12771 animal proteins a combined accuracy of 95.6% is obtained, eliminating 1/4 of BLAST alone errors, and for 2333 plant proteins the accuracy can be improved to 86% from 84%.
https://doi.org/10.1142/9781860947292_0008
Protein tertiary structures are known to have significant correlations with their biological functions. To understand the information of the protein structures, Structural Classification of Protein (SCOP) Database, which is manually constructed by human experts, classifies similar protein folds in the same domain hierarchy. Even though this approach is believed to be more reliable than applying traditional alignment methods in structural classifications, it is labor intensive. In this paper, we build a non-parametric classifier to predict possible SCOP domains for unknown protein structures. With supervised learning, the algorithm first maps tertiary structures of training proteins into two-dimensional distance matrices, and then extracts signatures from visual contents of matrices. A knowledge discovery and data mining (KDD) process further discovers relevant patterns in training signatures of each SCOP domain by mining association rules. Finally, the quantity of rules whose patterns match signatures of unknown proteins determines predicted domains in a ranked order. We select 7,702 protein chains from 150 domains of SCOP database 1.67 release as labelled data using 10 fold cross validation. Experimental results show that the prediction accuracy is 91.27% for the top ranked domain and 99.22% for the top 5 ranked domains. The average response time takes 6.34 seconds, exhibiting reasonably high prediction accuracy and efficiency.
https://doi.org/10.1142/9781860947292_0009
The central role phylogeny plays in biology and its pervasiveness in comparative genomics studies have led researchers to develop a plethora of methods for its accurate reconstruction. Most phylogeny reconstruction methods, though, assume a single tree underlying a given sequence alignment. While a good first approximation in many cases, a tree may not always model the evolutionary history of a set of organisms. When events such as interspecific recombination occur, different regions in the alignment may have different underlying trees. Accurate reconstruction of the evolutionary history of a set of sequences requires recombination detection, followed by separate analyses of the non-recombining regions. Besides aiding accurate phylogenetic analyses, detecting recombination helps in understanding one of the main mechanisms of bacterial genome diversification. In this paper, we introduce RECOMP, an accurate and fast method for detecting recombination events in a sequence alignment. The method slides a fixed-width window across the alignment and determines the presence of recombination events based on a combination of topology and parsimony score differences in neighboring windows. On several synthetic and biological datasets, our method performs much faster than existing tools with accuracy comparable to the best available method.
https://doi.org/10.1142/9781860947292_0010
One of the main issues in comparative genomics is the study of chromosomal gene order in one or more related species. Recently identifying sets of orthologous genes in several genomes has become getting important, since a cluster of similar genes helps us to predict the function of unknown genes. For this purpose, the whole genome alignment is usually used to determine horizontal gene transfer, gene duplication, and gene loss between two related genomes. Also it is well known that a novel visualization tool of the whole genome alignment would be very useful for understanding genome organization and the evolutionary process. In this paper, we propose a method for identifying and visualizing the alignment of the whole genome alignment, especially for detecting gene clusters between two aligned genomes. Since the current rigorous algorithm for finding gene clusters has strong and artificial constraints, they are not useful for coping with “noisy” alignments. We developed the system AlignScope to provide a simplified structure for genome alignment at any level, and also to help us to find gene clusters. In this experiment, we have tested AlignScope on several microbial genomes.
https://doi.org/10.1142/9781860947292_0011
Finding common patterns, motifs, in a set of DNA sequences is an important problem in bioinformatics. One common representation of motifs is a string with symbols A, C, G, T and N where N stands for the wildcard symbol. In this paper, we introduce a more general motif discovery problem without any weaknesses of the Planted (l,d)-Motif Problem and also a set of control sequences as an additional input. The existing algorithms using brute force approach for solving similar problem take O(n(t+f)l5l) times where t and f are the number of input sequences and control sequences respectively, n is the length of each sequence and l is the length of the motif. We propose an efficient algorithm, called VAS, which has an expected running time O(nfl(nt)k(4k−1+1/4k−1)l) using O((nt)k(4k−1+1/4k−1)l) space for any integer k. In particular when k = 3, the time and space complexities are O(nlf(nt)3(1.0625)l) and O((nt)3(1.0625)l) respectively. This algorithm makes use of voting and graph representation for better time and space complexities. This technique can also be used to improve the performances of some existing algorithms.
https://doi.org/10.1142/9781860947292_0012
The interaction between transcription factors and their DNA binding sites plays a key role for understanding gene regulation mechanisms. Recent studies revealed the presence of “functional polymorphism” in genes that is defined as regulatory variation measured in transcription levels due to the cis-acting sequence differences. These regulatory variants are assumed to contribute to modulating gene functions. However, computational identifications of such functional cis-regulatory variants is a much greater challenge than just identifying consensus sequences, because cis-regulatory variants differ by only a few bases from the main consensus sequences, while they have important consequences for organismal phenotype. None of the previous studies have directly addressed this problem. We propose a novel discriminative detection method for precisely identifying transcription factor binding sites and their functional variants from both positive and negative samples (sets of upstream sequences of both bound and unbound genes by a transcription factor) based on the genome-wide location data. Our goal is to find such discriminative substrings that best explain the location data in the sense that the substrings precisely discriminate the positive samples from the negative ones rather than finding the substrings that are simply over-represented among the positive ones. Our method consists of two steps: First, we apply a decision tree learning method to discover discriminative substrings and a hierarchical relationship among them. Second, we extract a main motif and further a second motif as a cis-regulatory variant by utilizing functional annotations. Our genome-wide experimental results on yeast Saccharomyces cerevisiae show that our method presented significantly better performances for detecting experimentally verified consensus sequences than current motif detecting methods. In addition, our method has successfully discovered second motifs of putative functional cis-regulatory variants which are associated with genes of different functional annotations, and the correctness of those variants have been verified by expression profile analyses.
https://doi.org/10.1142/9781860947292_0013
This paper considers a problem of finding control strategies for Boolean networks, where Boolean networks have been used as a model of genetic networks. This paper shows that finding a control strategy leading to the desired global state is NP-hard even if there is only one control node in the network. This result justifies existing exponential time algorithms for finding control strategies for probabilistic Boolean networks. On the other hand, this paper shows that the problem can be solved in polynomial time if the network has a tree structure.
https://doi.org/10.1142/9781860947292_0014
Sequencing of peptide sequences using tandem mass spectrometry data is an important and challenging problem in proteomics. In this paper, we address the problem of peptide sequencing for multi-charge spectra. Most peptide sequencing algorithms currently handle spectra of charge 1 or 2 and have not been designed to handle higher-charge spectra. We give a characterization of multi-charge spectra by generalizing existing models. Using these new models, we have analyzed spectra with charges 1-5 from the GPM [8] datasets. Our analysis shows that higher charge peaks are present and they contribute significantly to prediction of the complete peptide. They also help to explain why existing algorithms do not perform well on multi-charge spectra. We also propose a new de novo algorithm for dealing with multi-charge spectra based on the new models. Experimental results show that it performs well on all spectra, especially so for multi-charge spectra.
https://doi.org/10.1142/9781860947292_0015
Finding motifs in DNA sequences plays an important role in deciphering transcriptional regulatory mechanisms and drug target identification. In this paper, we propose an efficient algorithm, EDAM, for finding motifs based on frequency transformation and Minimum Bounding Rectangle (MBR) techniques. It works in three phases,frequency transformation, MBR-clique searching and motif discovery. In frequency transformation, EDAM divides the sample sequences into a set of substrings by sliding windows, then transforms them to frequency vectors which are stored in MBRs. In MBR-clique searching, based on the frequency distance theorems EDAM searches for MBR-cliques used for motif discovery. In motif discovery, EDAM discovers larger cliques by extending smaller cliques with their neighbors. To accelerate the clique discovery, we propose a range query facility to avoid unnecessary computations for clique extension. The experimental results illustrate that EDAM well solves the running time bottleneck of the motif discovery problem in large DNA database.
https://doi.org/10.1142/9781860947292_0016
Multiple loci analysis has become popular with the advanced development in biological experiments. A lot of studies have been focused on the biological and the statistical properties of such multiple loci analysis. In this paper, we study one of the important computational problems: solving the probabilities of haplotype classes from a large linear system Ax = b derived from the recombination events in multiple loci analysis. Since the size of the recombination matrix A increases exponentially with respect to the number of loci, fast solvers are required to deal with a large number of loci in the analysis. By exploiting the nice structure of the matrix A, we develop an efficient recursive algorithm for solving such structured linear systems. In particular, the complexity of the proposed algorithm is of O(m log m) operations and the memory requirement is of O(m) locations where m is the size of the matrix A. Numerical examples are given to demonstrate the effectiveness of our efficient solver.
https://doi.org/10.1142/9781860947292_0017
The factors governing codon and amino acid usages in the predicted protein-coding sequences of Tropheryma whipplei TW08/27 and Twist genomes have been analyzed. Multivariate analysis identifies the replicational-transcriptional selection coupled with DNA strand-specific asymmetric mutational bias as a major driving force behind the significant inter-strand variations in synonymous codon usage patterns in T. whipplei genes, while a residual intra-strand synonymous codon bias is imparted by a selection force operating at the level of translation. The strand-specific mutational pressure has little influence on the amino acid usage, for which the mean hydropathy level and aromaticity are the major sources of variation, both having nearly equal impact. In spite of the intracellular life-style, the amino acid usage in highly expressed gene products of T. whipplei follows the cost-minimization hypothesis. Both the genomes under study are characterized by the presence of two distinct groups of membrane-associated genes, products of which exhibit significant differences in primary and potential secondary structures as well as in the propensity of protein disorder.
https://doi.org/10.1142/9781860947292_0018
Trends in synonymous codon usage in adenoviruses have been examined through the multivariate statistical analysis on the annotated protein-coding regions of 22 adenoviral species, for which complete genome sequences are available. One of the major determinants of such trends is the G + C content at third codon positions of the genes, the average value of which varied from one viral genome to other depending on the overall mutational bias of the species. G3S and C3S interacted synergistically along the first principal axis of Correspondence analysis on the Relative Synonymous Codon Usage of adenoviral genes, but antagonistically along the second principal axis. Other major determinants of the trends are the natural selection, putatively operative at the level of translation and quite interestingly, hydropathy of the encoded proteins. The trends in codon usage, though characterized by distinct virus-specific mutational bias, do not exhibit any sign of host-specificity. Significant variations are observed in synonymous codon choice in structural and nonstructural genes of adenoviruses.
https://doi.org/10.1142/9781860947292_0019
Microarray gene expression data often contains missing values resulted from various reasons. However, most of the gene expression data analysis algorithms, such as clustering, classification and network design, require complete information, that is, without any missing values. It is therefore very important to accurately impute the missing values before applying the data analysis algorithms. In this paper, an Iterated Local Least Squares Imputation method (ILLsimpute) is proposed to estimate the missing values. In ILLsimpute, a similarity threshold is learned using known expression values and at every iteration it is used to obtain a set of coherent genes for every target gene containing missing values. The target gene is then represented as a linear combination of the coherent genes, using the least squares. The algorithm terminates after certain iterations or when the imputation converges. The experimental results on real microarray datasets show that ILLsimpute outperforms three most recent methods on several commonly tested datasets.
https://doi.org/10.1142/9781860947292_0020
Multiple sequence alignments can provide information for comparative analyses of proteins and protein populations. We present some statistical trend-tests that can be used when an aligned data set can be divided into two or more populations based on phenotypic traits such as preference of temperature, pH, salt concentration or pressure. The approach is based on estimation and analysis of the variation between the values of physicochemical parameters at positions of the sequence alignment. Monotonic trends are detected by applying a cumulative Mann-Kendall test. The method is found to be useful to identify significant physicochemical mechanisms behind adaptation to extreme environments and uncover molecular differences between mesophile and extremophile organisms. A filtering technique is also presented to visualize the underlying structure in the data. All the comparative statistical methods are available in the toolbox DeltaProt.
https://doi.org/10.1142/9781860947292_0021
Multiclass cancer classification based on microarray data is described. A generalized output-coding scheme combined with binary classifiers is used. Different coding strategies, decoding functions and feature selection methods are combined and validated on two cancer datasets: GCM and ALL. The effects of these different methods and their combinations are then discussed. The highest testing accuracies achieved are 78% and 100% for the two datasets respectively. The results are considered to be very good when compared with the other researchers’ work.
https://doi.org/10.1142/9781860947292_0022
The inference of evolutionary relationships is usually aided by a reconstruction method which is expected to produce a reasonably accurate estimation of the true evolutionary history. However, various factors are known to impede the reconstruction process and result in inaccurate estimates of the true evolutionary relationships. Detecting and removing errors (wrong branches) from tree estimates bear great significance on the results of phylogenetic analyses. Methods have been devised for assessing the support of (or confidence in) phylogenetic tree branches, which is one way of quantifying inaccuracies in trees. In this paper, we study, via simulations, the performance of the most commonly used methods for assessing branch support: bootstrap of maximum likelihood and maximum parsimony trees, consensus of maximum parsimony trees, and consensus of Bayesian inference trees. Under the conditions of our experiments, our findings indicate that the actual amount of change along a branch does not have strong impact on the support of that branch. Further, we find that bootstrap and Bayesian estimates are generally comparable to each other, and superior to a consensus of maximum parsimony trees. In our opinion, the most significant finding of all is that there is no threshold value for any of the methods that would allow for the elimination of wrong branches while maintaining all correct ones—there are always weakly supported true positive branches.
https://doi.org/10.1142/9781860947292_0023
The rapid growth of biological databases not only provides biologists with abundant data but also presents a big challenge in relation to the analysis of data. Many data analysis approaches such as data mining, information retrieval and machine learning have been used to extract frequent patterns from diverse biological databases. However, the discrepancies, due to the differences in the structure of databases and their terminologies, result in a significant lack of interoperability. Although ontology-based approaches have been used to integrate biological databases, the inconsistent analysis of biological databases has been greatly disregarded. This paper presents a method by which to measure the degree of inconsistency between biological databases. It not only presents a guideline for correct and efficient database integration, but also exposes high quality data for data mining and knowledge discovery.
https://doi.org/10.1142/9781860947292_0024
We address the issue of structured motif inference. This problem is stated as follows: given a set of n DNA sequences and a quorum q (%), find the optimal structured consensus motif described as gaps alternating with specific regions and shared by at least q × n sequences. Our proposal is in the domain of metaheuristics: it runs solutions to convergence through a cooperation between a sampling strategy of the search space and a quick detection of local similarities in small sequence samples. The contributions of this paper are: (1) the design of a stochastic method whose genuine novelty rests on driving the search with a threshold frequency f discrimining between specific regions and gaps; (2) the original way for justifying the operations especially designed; (3) the implementation of a mining tool well adapted to biologists’ exigencies: few input parameters are required (quorum q, minimal threshold frequency f, maximal gap length g). Our approach proves efficient on simulated data, promoter sites in Dicot plants and transcription factor binding sites in E. coli genome. Our algorithm, Kaos, compares favorably with MEME and STARS in terms of accuracy.
https://doi.org/10.1142/9781860947292_0025
We present a randomized algorithm for semi-supervised learning of Mahalanobis metrics over ℝn. The inputs to the algorithm are a set, U, of unlabeled points in ℝn, a set of pairs of points, S = {(x, y)i}; x, y ∈ U, that are known to be similar, and a set of pairs of points, D = {(x, y)i}; x, y ∈ U, that are known to be dissimilar. The algorithm randomly samples S, D, and m-dimensional subspaces of ℝn and learns a metric for each subspace. The metric over ℝn is a linear combination of the subspace metrics. The randomization addresses issues of efficiency and overfitting. Extensions of the algorithm to learning non-linear metrics via kernels, and as a pre-processing step for dimensionality reduction are discussed. The new method is demonstrated on a regression problem (structure-based chemical shift prediction) and a classification problem (predicting clinical outcomes for immunomodulatory strategies for treating severe sepsis).
https://doi.org/10.1142/9781860947292_0026
Sequence-dependence of DNA conformation plays an essential role in the protein-DNA recognition process during the regulation of gene expression. Proteins recognize specific DNA sequences not only directly through contact between bases and amino acids, but also indirectly through sequence-dependent conformation of DNA. To test to what extent the DNA sequence defines the DNA structure we analyzed the conformational space of all unique tetranucleotides. The large quantity of data needed for this study was obtained by carrying out molecular dynamics simulations of dodecamer B-DNA structures. Separate simulations were performed for each of the possible 136 unique tetranucleotides at the dodecamer centers and the simulated trajectories were transformed into the DNA conformational space. This allowed us to explain the multimodal conformational state of some dinucleotides as aggregations of tetranucleotide conformational states that have such a dinucleotide inside their center. We proposed simple models to express in a linear way how the different bases that embrace a central dinucleotide perturb its conformational state, emphasizing how the conformational role of each base depends on its relative position (left, central, right) in the final tetranucleotide, and how the same peripherical base plays a different role depending on which is the central dinucleotide. These models allow us to establish an index to quantify the degree of context-dependence, observing an increasing context-dependence from the average base-pair step conformations AA/TT, CG, AC/GT (context-independent), AG/CT, AT, GC, GG/CC (weakly context-dependent), and GA/TC, CA/TG, TA (context-dependent).
https://doi.org/10.1142/9781860947292_0027
The prediction of β-turn, despite the observation that one out of four residues in protein belongs to this structure element, has attracted considerably less attention comparing to secondary structure predictions. Neural network machine learning is a popular approach to address such a problem of structural bioinformatics. In this paper, we describe a new neural network model for β-turn prediction that accounts for site-specific amino acid preference, a property ignored in previous training models. We showed that the statistics of amino acid preference at specific sites within and around a β-turn is rather significant, and incorporation of this property helps improve the network performance. Furthermore, by contrasting with a previous model, we revealed a deficiency of not incorporating this site-specific property in previous models.
https://doi.org/10.1142/9781860947292_0028
Transcription regulation is mediated by combinatorial interactions between diverse trans-acting proteins and arrays of cis-regulatory sequences. Revealing this complex interplay between transcription factors and binding sites remains a fundamental problem for understanding the flow of genetic information. The oPOSSUM analysis system facilitates the interpretation of gene expression data through the analysis of transcription factor binding sites shared by sets of co-expressed genes. The system is based on cross-species sequence comparisons for phylogenetic footprinting and motif models for binding site prediction. We introduce a new set of analysis algorithms for the study of the combinatorial properties of transcription factor binding sites shared by sets of co-expressed genes. The new methods circumvent computational challenges through an applied focus on families of transcription factors with similar binding properties. The algorithm accurately identifies combinations of binding sites over-represented in reference collections and clarifies the results obtained by existing methods for the study of isolated binding sites.
https://doi.org/10.1142/9781860947292_0029
Local structure prediction can facilitate ab initio structure prediction, protein threading, and remote homology detection. However, previous approaches to local structure prediction suffer from poor accuracy. In this paper, we propose a knowledge-based prediction method that assigns a measure called the local match rate to each position of an amino acid sequence to estimate the confidence of our approach. To remedy prediction results with low local match rates, we use a neural network prediction method. Then, we have a hybrid prediction method, HYPLOSP (HYbrid method to Protein LOcal Structure Prediction) that combines our knowledge-based method with a neural network method. We test the method on two different structural alphabets and evaluate it by QN, which is similar to Q3 in secondary structure prediction. The experimental results show that our method yields a significant improvement over previous studies.
https://doi.org/10.1142/9781860947292_0030
MiRNAs are short non-coding RNAs that regulate gene expression. While the first miRNAs were discovered using experimental methods, experimental miRNA identification remains technically challenging and incomplete. This calls for the development of computational approaches to complement experimental approaches to miRNA gene identification. We propose in this paper a de novo miRNA precursor prediction method. This method follows the “feature generation, feature selection, and feature integration” paradigm of constructing recognition models for genomics sequences. We generate and identified features based on information in both primary sequence and secondary structure, and use these features to construct SVM-based models for the recognition of miRNA precursors. Experimental results show that our method is effective, and can achieve good sensitivity and specificity.
https://doi.org/10.1142/9781860947292_0031
Genome-wide computational analysis for small nuclear RNA (snRNA) genes resulted in identification of 76 and 73 putative snRNA genes from indica and japonica rice genomes, respectively. We used the basic criteria of a minimum of 70 % sequence identity to the plant snRNA gene used for genome search, presence of conserved promoter elements: TATA box, USE motif and monocot promoter specific elements (MSPs) and extensive sequence alignment to rice / plant expressed sequence tags to denote predicted sequence as snRNA genes. Comparative sequence analysis with snRNA genes from other organisms and predicted secondary structures showed that there is overall conservation of snRNA sequence and structure with plant specific features (presence of TATA box in both polymerase II and III transcribed genes, location of USE motif upstream to the TATA box at fixed but different distance in polymerase II and polymerase III transcribed snRNA genes) and the presence of multiple monocot specific MSPs upstream to the USE motif. Detailed analysis results including all multiple sequence alignments, sequence logos, secondary structures, sequences etc are available at http://kulab.org
https://doi.org/10.1142/9781860947292_0032
The gene tree and species tree problem remains a central problem in phylogenomics. To overcome this problem, gene concatenation approaches have been used to combine a certain number of genes randomly from a set of widely distributed orthologous genes selected from genome data to conduct phylogenetic analysis. The random concatenation mechanism prevents us from the further investigations of the inner structures of the gene data set employed to infer the phylogenetic trees and locates the most phylogenetically informative genes. In this work, a phylogenomic mining approach is described to gain knowledge from a gene data set by clustering genes in the gene set through a self-organizing map (SOM) to explore the gene dataset inner structures. From this, the most phylogenetically informative gene set is created by picking the maximum entropy gene from each cluster to infer phylogenetic trees by phylogenetic analysis. Using the same data set, the phylogenetic mining approach performs better than the random gene concatenation approach.
https://doi.org/10.1142/9781860947292_0033
In this paper, we give a complete characterization of the existence of a galled-tree network in the form of simple sufficient and necessary conditions. As a by-product we obtain as simple algorithm for constructing galled-tree networks. We also introduce a new necessary condition for the existence of a galled-tree network similar to bi-convexity.
https://doi.org/10.1142/9781860947292_0034
The analysis of time series data is of capital importance for pharmacogenomics since the experimental evaluations are usually based on observations of time dependent reactions or behaviors of organisms. Thus, data mining in time series databases is an important instrument towards understanding the effects of drugs on individuals. However, the complex nature of time series poses a big challenge for effective and efficient data mining. In this paper, we focus on the detection of temporal dependencies between different time series: we introduce the novel analysis concept of threshold queries and its semi-supervised extension which supports the parameter setting by applying training datasets. Basically, threshold queries report those time series exceeding an user-defined query threshold at certain time frames. For semi-supervised threshold queries the corresponding threshold is automatically adjusted to the characteristics of the data set, the training dataset, respectively. In order to support threshold queries efficiently, we present a new efficient access method which uses the fact that only partial information of the time series is required at query time. In an extensive experimental evaluation we demonstrate the performance of our solution and show that semi-supervised threshold queries applied to gene expression data are very worthwhile.
https://doi.org/10.1142/9781860947292_0035
Protein nuclear magnetic resonance (NMR) chemical shifts are among the most accurately measurable spectroscopic parameters and are closely correlated to protein structure because of their dependence on the local electronic environment. The precise nature of this correlation remains largely unknown. Accurate prediction of chemical shifts from existing structures’ atomic co-ordinates will permit close study of this relationship. This paper presents a novel non-linear regression based approach to chemical shift prediction from protein structure. The regression model employed combines quantum, classical and empirical variables and provides statistically significant improved prediction accuracy over existing chemical shift predictors, across protein backbone atom types. The results presented here were obtained using the Random Forest regression algorithm on a protein entry data set derived from the RefDB re-referenced chemical shift database.
https://doi.org/10.1142/9781860947292_0036
Automated discovery and extraction of biological relations from online documents, particularly MEDLINE texts, has become essential and urgent because such literature data are accumulated in a tremendous growth. In this paper, we present an ontology-based framework of biological relation extraction system. This framework is unified and able to extract several kinds of relations such as gene-disease, gene-gene, and protein-protein interactions etc. The main contributions of this paper are that we propose a two-level pattern learning algorithm, and organize patterns hierarchically.
https://doi.org/10.1142/9781860947292_0037
To reconstruct a phylogeny for a given set of species, most of the previous approaches are based on the similarity information derived from a subset of conserved regions (or genes) in the corresponding genomes. In some cases, the regions chosen may not reflect the evolutionary history of the species and may be too restricted to differentiate the species. It is generally believed that the inference could be more accurate if whole genomes are being considered. The best existing solution that makes use of complete genomes was proposed by Henz et al.13 They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on the measurement of non-trivial splits) while other approaches that use whole genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using BLASTN which is not primarily designed for whole genome alignment Also, their approach is not scalable, for example, it probably takes over 1000 CPU hours to construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy of the phylogeny. In this paper, we propose the followings.
(1) To evaluate the quality of a phylogeny with respect to a model answer, we suggest to use the concept of the maximum agreement subtree as it can capture the structure of the phylogeny.
(2) We propose to use whole genome alignment software (such as MUMmer) to measure the distances between the species and derive an efficient approach to generate these distances.
From the experiments on real data sets, we found that our approach is more accurate and more scalable than Henz et al.’s approach. We can construct a phylogenetic tree for the same set of 91 genomes with an accuracy more than 20% higher (with respect to both evaluation measures) in 2 CPU hours (more than 80 times faster than their approach). Also, our approach is scalable and can construct a phylogeny for 230 prokaryotic genomes with accuracy as high as 85% in only 9.5 CPU hours.
https://doi.org/10.1142/9781860947292_0038
Clustering is widely used in gene expression analysis, which helps to group genes with similar biological function together. The traditional clustering techniques are not suitable to be directly applied to gene expression time series data, because of the inhered properties of local regulation and time shift. In order to cope with the existing problems, the local similarity and time shift, we have developed a new similarity measurement technique called Local Similarity Combination in this paper. And at last, we’ll run our method on the real gene expression data and show that it works well.
https://doi.org/10.1142/9781860947292_bmatter
AUTHOR INDEX