High-throughput sequencing and functional genomics technologies have given us a draft human genome sequence and have enabled large-scale genotyping and gene expression profiling of human populations. Databases containing large number of sequences, polymorphisms, and gene expression profiles of normal and diseased tissues in different clinical states are rapidly being generated for human and model organisms. Bioinformatics is thus rapidly growing in importance in the annotation of genomic sequences, in the understanding of the interplay between genes and proteins, in the analysis the genetic variability of species, etc. The 3rd APBC brings together researchers, professionals, and industrial practitioners for interaction and exchange of knowledge and ideas. The proceedings contains the latest results that address conceptual and practical issues of bioinformatics.
Papers presented at APBC'05 and included in this proceedings volume span the following: Novel Applications in Bioinformatics, Computational Analysis of Biological Data, Data Mining & Statistical Modeling of Biological Data, Modeling and Simulation of Biological Processes, Visualization of Biological Processes and Data, Management, Migration, and Integration of Biological Databases, Access, Indexing, and Search in Biological Databases.
https://doi.org/10.1142/9781860947322_fmatter
PREFACE
APBC2005 ORGANIZATION
CONTENTS
https://doi.org/10.1142/9781860947322_0001
Recent work on whole genome alignment has resulted in efficient tools to locate (possibly) conserved regions of two genomic sequences. Most of such tools start with locating a set of short and highly similar substrings (called anchors) that are present in both genomes. These anchors provide clues for the conserved regions, and the effectiveness of the tools is highly related to the quality of the anchors. Some popular software tools use the exact match maximal unique substrings (EM-MUM) as anchors. However, the result is not satisfactory especially for genomes with high mutation rates (e.g. virus). In our experiments, we found that more than 40% of the conserved genes are not recovered. In this paper, we consider anchors with mismatches. Our contributions include the following.
• Based on the experiments on 35 pairs of virus genomes using three software tools (MUMmer-3, MaxMinCluster, MSS), we show that using anchors with mismatches does increase the effectiveness of locating conserved regions (about 10% more conserved gene regions are located, while maintaining a high sensitivity).
• To generate a more comprehensive set of anchors with mismatches is not trivial for long sequences due to the time and memory limitation. We propose two practical algorithms for generating this anchor set. One aims at speeding up the process, the other aims at saving memory. Experimental results show that both algorithms are faster (6 times and 5 times, respectively) than a straightforward suffix tree based approach.
https://doi.org/10.1142/9781860947322_0002
SVM-Pairwise was a major breakthrough in remote homology detection techniques, significantly outperforming previous approaches. This approach has been extensively evaluated and cited by later works, and is frequently taken as a benchmark. No known work however, has examined the gap penalty model employed by SVM-Pairwise. In this paper, we study in depth the relevance and effectiveness of SVM-Pairwise’s gap penalty model with respect to the homology detection task. We have identified some limitations in this model that prevented the SVM-Pairwise algorithm from realizing its full potential and also studied several ways to overcome them. We discovered a more appropriate gap penalty model that significantly improves the performance of SVM-Pairwise.
https://doi.org/10.1142/9781860947322_0003
For determining functionality dependencies between two proteins, both represented as 3D structures, it is an essential condition that they have a matching structure. As 3D structures for proteins are large, complex and constantly evolving, it is very time-consuming to identify possible locations and sizes of such a matching structure for a given protein against a large protein database. In this paper, we introduce a novel representation model and apply a transformation and formalization to this problem. We then propose a database solution by using innovative high dimensional indexing mechanisms. Experimental results demonstrate a promising performance of the high dimensional indexing to this biologically critical but previously computationally prohibitive problem.
https://doi.org/10.1142/9781860947322_0004
Effective management of data is a major issue in genomics. Genomics data consists of highly inter-related information about genes, proteins, patterns, classifications, and interactions. Graph databases, which model graphs or networks of data, seem a natural way to manage genomics data. Graph databases can support visualization of both queries and results by drawing on the broader field of graph visualization, and the visual paradigm is natural for scientists working in genomics. We have constructed a graph database system that supports visual queries and visualization of query result sets. Our system uses Java, XML, C++, CORAL, and MySQL to implement the GraphLog language. We have validated its applicability to genomics data through a case study, and have done initial studies on the systems performance and the effect of several optimization strategies. We describe the system and its application to genomics.
https://doi.org/10.1142/9781860947322_0005
Numerous high-throughput genomics assays require the amplification of a large number of genomic loci of interest. Amplification is cost-effectively achieved using several short single-stranded DNA sequences called primers and polymerase enzyme in a reaction called multiplex polymerase chain reaction (MP-PCR). Amplification of each locus requires that two of the primers bind to the forward and reverse DNA strands flanking the locus. Since the efficiency of PCR amplification falls off exponentially as the length of the amplification product increases, an important practical requirement is that the distance between the binding sites of the two primers should not exceed a certain threshold, In this paper we study MP-PCR primer set selection with amplification length constraints from both theoretical and practical perspectives. Our contributions include an improved analysis of a simple yet effective greedy algorithm for the problem, and a comprehensive experimental study comparing our greedy algorithm with other published heuristics on both synthetic and genomic database test cases.
https://doi.org/10.1142/9781860947322_0006
Protein threading with profiles in which constraints on distances between residues are given is known to be NP-hard. Moreover, a simple algorithm known as CLIQUETHREAD based on efficient reduction to maximum edge-weight clique finding problem has been known to be a practical algorithm for solving the protein threading problem with profiles and constraints. This algorithm is not efficient enough to be applicable to large scale threading prediction. Besides, the algorithm was only presented for profile threading with strict constraints. This paper presents a more efficient algorithm FTHREAD for profile threading with strict constraints which is more than 18 times faster than CLIQUETHREAD for larger proteins. Moreover, we also present a novel practical algorithm NTHREAD for profile threading with non-strict constraints. The comparison of FTHREAD with existing state-of-the-art methods shows that although our algorithm uses a simple threading function, our algorithm performs equally well as these existing methods for protein threading. Besides, our computational experiments for sequence-structure alignments for a number of proteins have shown better results for non-strict constraints threading than protein threading with strict constraints. We have also analyzed the effects of using a number of distance constraints.
https://doi.org/10.1142/9781860947322_0007
Fragment-based analysis of protein three-dimensional (3D) structures has received increased attention in recent years. Here, we used a set of pentamer local structure alphabets (LSAs) recently derived in our laboratory to represent protein structures, i.e. we transformed the 3D structures into one-dimensional (1D) sequences of LSAs. We then applied Hidden Markov Model training to these LSA sequences to assess their ability to capture features characteristic of 43 populated protein folds. In the size range of LSAs examined (5 to 41 alphabets), the performance was optimal using 20 alphabets, giving an accuracy of fold classification of 82% in a 5-fold cross-validation on training-set structures sharing < 40% pairwise sequence identity at the amino acid level. For test-set structures, the accuracy was as high as for the training set, but fell to 65% for those sharing no more than 25% amino acid sequence identity with the training-set structures. These results suggest that sufficient 3D information can be retained during the drastic 3D->1D transformation for use as a framework for developing efficient and useful structural bioinformatics tools.
https://doi.org/10.1142/9781860947322_0008
Consensus-based protein structure prediction methods have been proved to be successful in recent CASPs (Critical Assessment of Structure Prediction). By combining several weaker individual servers, a meta server tends to generate better predictions than any individual server. In this paper, we present a Support Vector Machines (SVM) regression-based consensus method for protein fold recognition, which is a key component for high-throughput protein structure prediction and protein functional annotation. Our SVM model extracts the features from a predicted structural model by comparing it to other models generated by all the individual servers and then predicts the quality of this model. Experimental results on several LiveBench data sets show that our consensus method consistently performs better than individual servers. Based on this approach, we have developed a meta server, Alignment by Consensus Estimator (ACE), which is participating in CASP6 and CAFASP4 (Fourth Critical Assessment of Fully Automated Structure Prediction). ACE is available at http://www.cs.uwaterloo.ca/~13yu/consensus.htm.
https://doi.org/10.1142/9781860947322_0009
We introduce a new approach for predicting the secondary structure of proteins using profiles and the Fuzzy K-Nearest Neighbor algorithm. K-Nearest Neighbor methods give relatively better performance than Neural Networks or Hidden Markov models when the query protein has few homologs in the sequence database to build sequence profile. Although the traditional K-Nearest Neighbor algorithms are a good choice for this situation, one of the difficulties in utilizing these techniques is that all the labeled samples are given equal importance while deciding the secondary structure class of the protein residue and once a class has been assigned to a residue, there is no indication of its confidence in a particular class. In this paper, we propose a system based on the Fuzzy K-Nearest Neighbor Algorithm that addresses the above-mentioned issues and the system outperforms earlier K-Nearest neighbor methods that use multiple sequence alignments. We also introduce a new distance measure to calculate the distance between two protein sequences, a new method to assign membership values to the Nearest Neighbors in each of the Helix, Strand and Coil classes. We also propose a novel heuristic based filter to smoothen the prediction. Particularly attractive feature of our filter is that it does not require retraining when new structures are added to the database. We have achieved a sustained three-state overall accuracy of 75.75% with our system. The software is available upon request.
https://doi.org/10.1142/9781860947322_0010
Understanding how protein folds into a functional and structural configuration is arguably one of the most important and challenging problems in computational biology. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus the reaction coordinates such as the fraction of native contacts, the radius of gyration, the principal components and so on. In this paper, we present a combinatorial algorithmic approach towards understanding the global state changes of the configurations. The approach is based on cluster computation, each cluster being defined by a pattern of a combination of various reaction coordinates. We present an algorithm of time complexity O((N + nm) log n) where N is the size of the output and n × m is the size of the input. To date, this is the best time complexity for the problem. We next demonstrate that this approach extracts crucial information about protein folding intermediate states and mechanism. (1) The method recovers states previously obtained by visually analyzing free energy contour maps. (2) It also succeeds in extracting meaningful patterns and structures that had been overlooked in previous works, which provide a better understanding of the folding mechanism (of a β-hairpin protein). These new patterns also interconnect various states in existing free energy contour maps versus different reaction coordinates. (3) The approach does not require the free energy values, yet it offers analysis comparable and sometimes better than the methods that use free energy landscapes, thus validating the choice of reaction coordinates.
https://doi.org/10.1142/9781860947322_0011
Transmembrane proteins affect vital cellular functions and diseases, and are a focus of drug design. It is difficult to obtain diffraction quality crystals to study transmembrane protein structure. Computational tools for transmembrane protein topology prediction fill in the gap between the abundance of transmembrane proteins and the scarcity of known membrane protein structures. Their prediction accuracy is still inadequate: TMHMM [2,7], the current state-of-the-art method, has less than 52% accuracy on the prediction of transmembrane proteins collected by Moller et al. [1, 4]. Based on the assumption that there are functional domains that occur preferentially internal or external to the membrane, we have extended the model of TMHMM, incorporating functional domain information into it, using an approach originally used in gene finding [8]. Results show that our Augmented HMM, or AHMM, is better than TMHMM on both helix and sidedness prediction. This improvement is verified by both statistical tests as well as sensitivity and specificity studies. As prediction of functional domain improves, our system’s prediction accuracy will likely improve as well.
https://doi.org/10.1142/9781860947322_0012
Subcellular localization performs an important role in genome analysis as a key functional characteristic of proteins. Therefore, an automatic, reliable and efficient prediction system for protein subcellular localization is needed for large-scale genome analysis. This paper describes a new residue-couple model using a support vector machine to predict the subcellular localization of proteins. This new approach provides better predictions than existing methods. The total prediction accuracies on Reinhardt and Hubbard’s dataset reach 92.0% for prokaryotic protein sequences and 86.9% for eukaryotic protein sequences with 5-fold cross validation. For a new dataset with 8304 proteins located in 8 subcellular locations, the total accuracy achieves 88.9%. The model shows robust against N-terminal errors in the sequences. A web server is developed based on the method which was used to predict some new proteins.
https://doi.org/10.1142/9781860947322_0013
Knowledge of targeting signals is of immense importance for understanding the cellular processes by which proteins are sorted and transported. This paper presents a system of recurrent neural networks which demonstrate an ability to detect residues belonging to specific targeting peptides with greater accuracy than current feed forward models. The system can subsequently be used for determining sub-cellular localisation of proteins and for understanding the factors underlying translocation. The work can be seen as building upon the currently popular series of predictors SignalP and TargetP, by exploiting the inherent bias for sequential pattern recognition exhibited by recurrent networks.
https://doi.org/10.1142/9781860947322_0014
Research on cleavage site prediction for signal peptides has focused mainly on the application of different classification algorithms to achieve improved prediction accuracies. This paper addresses the fundamental issue of amino acid encoding to present amino acid sequences in the most beneficial way for machine learning algorithms. A comparison of several standard encoding methods shows, that for cleavage site prediction the frequently used orthonormal encoding is inferior compared to other methods. The best results are achieved with a new encoding method named BLOMAP – based on the BLOSUM62 substitution matrix – using a Naïve Bayes classifier.
https://doi.org/10.1142/9781860947322_0015
This paper presents CIS, a biomedical simulation framework based on the markov random field (MRF). CIS is a discrete domain 2-D simulation framework emphasizing on the spatial interactions of biomedical entities. The probability model within the MRF framework facilitates the construction of more realistic models than deterministic differential equation approaches and cellular automata. The global phenomenon in CIS are dictated by the local conditional probabilities. In addition, multiscale MRF is potentially useful for the modelling of complex biomedical phenomenon in multiple spatial and time scales. The methodology and procedure of CIS for a biomedical simulation is presented using the scenario of tumor-induced hypoxia and angiogenesis as an example. The goal of this research is to unveil the complex appearances of biomedical phenomenon using mathematical models, thus enhancing our understanding on the secrets of life.
https://doi.org/10.1142/9781860947322_0016
Many human diseases are the result of abnormal interactions between multiple genes instead of single gene mutations. Discovering the interactions between these genes and their relationships to human diseases is critical for understanding mechanisms of diseases and helping design effective therapies. Valuable experimental evidence from years of industrious research by biologists can be used to help establish the underlying network of gene interactions related to human diseases. Fortunately, these information are habitually published in research journals whose abstracts are stored in a centralized, easily accessible public database called MEDLINE. To take advantage of this valuable resource, we have developed DiseasePathweaver—a computer-aided knowledge discovery system for extracting, summarizing and visualizing disease-specific gene interaction networks. Using DiseasePathweaver, a biologist can obtain a global overview of the gene interaction network related to a specific human disease, together with well-documented evidences linking to each gene and its putative interactions. We compared the gene networks of two complex human CNS diseases extracted by DiseasePathweaver to the corresponding networks from the human-curated KEGG database and found that our system can accurately cover 79% to 69% of the corresponding disease gene networks, showing the usefulness of DiseasePathweaver as a user-friendly knowledge discovery system for biologists to discover and understand gene interaction networks in complex human diseases. Free access to DiseasePathweaver is available for academic and non-profit users through http://pathweaver.i2r.a-star.edu.sg..
https://doi.org/10.1142/9781860947322_0017
Knowledge on molecular biological systems is increasing at an amazing pace. It is becoming harder to intuitively evaluate the significance of each interaction between molecules of the complex biological systems. Hence we need to develop an efficient mathematical method to explore the biological mechanism. In this paper, we employed hybrid functional Petri net to analyze the circadian genetic control mechanism, which is feedback loops of clock genes and generates endogenous near 24 hour rhythms in mammals. Based on the available biological data, we constructed a model and, by using Genomic Object Net, we performed computer simulations for time courses of clock gene transcription and translation. Although the original model successfully reproduced most of the circadian genetic control mechanism, two discrepancies remained despite wide selection of the parameters. We found that addition of an hypothetical path into the original model successfully simulated time courses and phase relations among clock genes. This also demonstrates usefulness of hybrid functional Petri net approach to biological systems.
https://doi.org/10.1142/9781860947322_0018
In proteomics, tandem mass spectrometry is the key technology for protein identification from the cells. However, partially due to the deficiency of peptide identification software, over half of the tandem mass spectra are discarded in almost all proteomics centers because they are not interpretable. The problem is more acute with the lower end data from low quality but cheaper devices such as the ion trap instruments. In order to deal with the noisy and low quality data, this paper develops a systematic approach to construct a robust linear scoring function, whose coefficients are determined by a linear program. A prototype, PRIMA, is implemented. When exhaustively tested with large benchmarks of varying qualities, PRIMA consistently outperforms the commonly used software MASCOT and SEQUEST with higher accuracy.
https://doi.org/10.1142/9781860947322_0019
We studied two cancer classification problems with mass spectrometry data and used SVM-RFE to select a small subset of peaks as input variables for the classification. Our study shows that, SVM-RFE can select a good small subset of peaks with which the classifier achieves high prediction accuracy and the performance is much better than with the feature subset selected by T-statistics. We also found that, the best peak subset selected by SVM-RFE always have in the top ranked peaks by T-statistics while it includes some peaks that are ranked low by T-statistics. However, these peaks together give much better classification performance than the same number of most top ranked peaks by T-statistics. Our experimental comparison of the performance of Support Vector Machine classification algorithm with and without peak selection also consolidates the importance of peak selection for cancer classification with mass spectrometry data. Selecting a small subset of peaks not only improves the efficiency of the classification algorithms, but also improves the cancer classification accuracy, even for classification algorithms like Support Vector Machines, which are capable of handling large number of input variables.
https://doi.org/10.1142/9781860947322_0020
Image registration technique is fundamental and essential to the accurate and efficient analysis of protein sequence data. But due to the elastic deformations of two-dimensional gel protein eletrophoresis images, their registration still remains a challenge. In this paper, a hybrid 2D gel protein image registration method is proposed. In the first stage of registration, the wavelet-based hierarchical registration which fully exploits the image intensity is used to correct global displacements and affine transformation parameters are achieved. In the second stage, the landmark-based elastic registration is introduced to correct local displacements and enhance registration performance and accuracy. In the proposed method, the hierarchical registration from low resolution to high resolution can accelerate the registration convergence and hence good computational efficiency can be achieved. Though wavelets have been widely used in image processing, its application in the area of gel protein image registration is not well investigated. In this paper, the wavelet-based hierarchical approach is introduced in 2D gel protein image registration. The algorithm makes use of the merits of the existing categories of registration techniques and achieves high performance registration results automatically.
https://doi.org/10.1142/9781860947322_0021
Cancer classification is one major application of microarray data analysis. Due to the ultra high dimensionality nature of microarray data, data dimension reduction has drawn special attention for such type of data analysis. The currently available data dimension reduction methods are either supervised, where data need to be labeled, or computational complex. In this paper, we proposed to use a revised locally linear embedding(LLE) method, which is purely unsupervised and fast as the feature extraction strategy for microarray data analysis. Three public available microarray datasets have been used to test the proposed method. The effectiveness of LLE is evaluated by the classification accuracy of a SVM classifier. Generally, the results are promising.
https://doi.org/10.1142/9781860947322_0022
Accurate cancer prediction is important for treatment of cancers. The combination of two dimension reduction methods, partial least squares (PLS) and singular value decomposition (SVD), with the penalized logistic regression (PLR) has created powerful classifiers for cancer prediction using microarray data, Comparing with support vector machine (SVM) on seven publicly available cancer datasets, the new algorithms can achieve very good performance and run much faster. They also have the advantage that the probabilities of predictions can be directly given. PLS based PLR is also combined with recursive feature elimination (RFE) to select a 16-gene subset for acute leukemia cancer classification. The testing error on this subset of genes is empirically zero.
https://doi.org/10.1142/9781860947322_0023
Microarray technology allows large-scale parallel measurements of the expression of many thousands genes and thus aiding in the development of efficient cancer diagnosis and classification platforms. In this paper, we apply the genetic algorithm and the silhouette statistic in conjunction with several distance functions to the problem of multi-class prediction. We examine two widely used sets of gene expression data, measured across sets of tumors, and present the results of classification accuracy on these two datasets by our methods. Our best success rate of tumor classification has better accuracy than many previously reported methods and it provides a useful method towards a complete tool in this domain.
https://doi.org/10.1142/9781860947322_0024
In this paper we study the problem of identifying meaningful patterns (i.e., motifs) from biological data. The general version of this problem is NP-hard. Numerous algorithms have been proposed in the literature to solve this problem. Many of these algorithms fall under the category of heuristics. We concentrate on exact algorithms in this paper. In particular, we concentrate on two different versions of the motif search problem and offer exact algorithms for them.
https://doi.org/10.1142/9781860947322_0025
The problem of identifying meaningful patterns (i.e., motifs) from biological data has been studied extensively due to its paramount importance. Three versions of this problem have been identified in the literature. One of these three problems is the planted (l, d)-motif problem. Several instances of this problem have been posed as a challenge. Numerous algorithms have been proposed in the literature that address this challenge. Many of these algorithms fall under the category of approximation algorithms. In this paper we present algorithms for the planted (l, d)-motif problem that always find the correct answer(s). Our algorithms are very simple and are based on ideas that are fundamentally different from the ones employed in the literature. We believe that the techniques we introduce in this paper will find independent applications.
https://doi.org/10.1142/9781860947322_0026
Pevzner and Sze [14] have introduced the Planted (l,d)-Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of co-regulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve the motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9,2), (11,3), (15,5)-motif problems but for even longer motifs, say (20,7), (30,11) and (40,15), which have never been seriously attempted by other researchers because of heavy time and space requirements.
https://doi.org/10.1142/9781860947322_0027
In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi.
https://doi.org/10.1142/9781860947322_0028
Histones constitute a rich protein family that is evolutionarily conserved across species. They play important roles in chromosomal functions in cell, such as chromosome condensation, recombination, replication, and transcription. We have modeled histone gene 5' end segments covering [-50,+500] relative to transcription start sites (TSSs). These segments contain parts of the coding regions in most of the genes that we studied. We determined characteristics of these segments for 116 mammalian (human, mouse, rat) histone genes based on distribution of DNA motifs obtained from MEME-MAST. We found that all five mammalian histone types (H1, H2A, H2B, H3, H4) have mutually distinct, prominent and strongly conserved properties downstream to the TSS reasonably well conserved across analyzed species. We then transformed the primary level motif data for each sequence into a higher order motif arrangement that involved only features such as presence of a motif, its position, its strand orientation, and mutual spacer length between motifs. We have built a Bayesian Network model based on these features and used the higher order motif arrangement data for its training and testing. When tested for classification between the five histone groups and using the leave-one-out cross-validation technique, the Bayesian model correctly classified 100% of histone H1 sequences, 100% of histone H2A sequences, 96.9% of histone H2B sequences, 94.4% of histone H3 sequences, and 95.8% of histone H4 sequences. Overall, the model correctly classified 97.4% of all histones sequences, Our Bayesian model has the advantage in having a small number of trainable parameters and it produces very few false positives. The model could be used to scan the genome for discovery of genes whose products are similar to histones.
https://doi.org/10.1142/9781860947322_0029
Human ribonuclease A (RNase A) superfamily consists of eight RNases with high sequence homology, in which RNase2 and RNase3 share 78% similarity. The evolutionary variation of RNases results in differential structure and function of the enzymes. To distinguish the characteristics of each RNase, we developed reinforced merging algorithms (RMA) to rapidly predict and identify the unique sequence motifs for each member of the highly conservative human RNaseA superfamily. Two unique regions in RNase3 were predicted and experimentally confirmed to contain epitopes for monoclonal antibodies (mAbs) specifically against RNase3. Our method provides a useful tool for identification of unique sequence motif for further experimental design.
https://doi.org/10.1142/9781860947322_0030
Identifying and assaying the relative abundance of members of complex microbial communities is an important problem in ecology. Sandberg et al.11 investigated the usage of genomic signatures to provide high identification percentages from short sequence samples. In this paper we present an improved naive Bayesian classification method using conditional probabilities, which can be used to classify unsequenced bacterial species, as well as identify and predict the frequency of the dominant species in mixed microbial populations.
https://doi.org/10.1142/9781860947322_0031
Viral infection poses a major problem for public health, horticulture and animal husbandry, possibly causing severe health crises and economic loss. Viral infections can be identified by the specific detection of viral sequences in two ways, the first is the amplification-based method, such as using the polymerase chain reaction (PCR), the reverse transcription-polymerase chain reaction (RT-PCR), or nested-PCR, for example, and the second is the hybridization-based approach, such as the use of southern blotting, northern blotting, dot blotting and DNA chips. The former provides the advantages of fast and specific detection and a lower detection limit, but also has some the following weakness; (1) the clinicians must assess which viruses are suspected in an infectious event; (2) the nucleotides on the nearest 3’-end of the designed primers are very important to the successful of the extension of the primer; (3) although multiplex PCR can be used to detect many viral sequences simultaneously, diagnosing the viral sequences of over 20 different species or strains in a single reaction is currently very difficult. The hybridization-based method can not only tolerate sequence variations of newly evolved virus strains, but can also simultaneously diagnose more viral sequences in a single reaction than can multiplex PCR. Many chips have so fat been designed for clinical use. Most are designed for special purpose, such as typing enterovirus infection, and compare fewer than 30 different viral sequences. None considers all primer design, increasing the likelihood of cross hybridization of similar sequences with other viral sequences. To prevent this possibility, this work establishes a platform and database that provides users with specific probes of all known viral genome sequences, to designing their diagnostic chips. This work develops a system for designing probes online. A user can select any number of different viruses and set their experimental conditions. Including, for example, melting temperature, length of probe. The system then return the optimal sequences to suspected viral infections to be automatically identified from database. The system that supports probe design for identifying virus has been published on our web page http://bioinfo.csie.ncu.edu.tw..
https://doi.org/10.1142/9781860947322_0032
A new peptide encoding scheme is proposed to use with support vector machines for the direct recognition of T cell epitopes. The method enables the presentation of information on both (1) amino acid positions in peptides and (2) the similarity between amino acids through the use of sparse indicator vectors and the BLOSUM50 matrix. A procedure of feature selection is also introduced. The computational results demonstrate superior performance over previous techniques.
https://doi.org/10.1142/9781860947322_0033
Evolution is an important sub-area of study in biological science, whereby the evolutionary history, or phylogeny, would shed light on the genetic linkage and the functional correlation for the species under consideration. Many kinds of species data can be deployed for the task and many phylogeny reconstruction methods have been examined in the literature. A quartet approach is to build a local phylogeny for every 4 species, which is called a quartet for these 4 species, and then to assemble a phylogeny for the whole set of species satisfying the topological constraints imposed by these quartets built. In practice, those predicted quartets might not agree each other and the optimization problem, the well-known Maximum Quartet Consistency (MQC) problem, is to construct a phylogeny to satisfy a maximum number of the predicted quartets. An equivalent representation for the MQC problem through searching for a certain ultrametric matrix via Answer Set Programming has recently been proposed. This paper follows the approach and presents a number of optimization techniques to speed up the searching process. The experimental results on both the simulated and real datasets suggest that the new representation combined with Constraint Programming presents a unique perspective to the MQC problem.
https://doi.org/10.1142/9781860947322_0034
To construct a phylogenetic tree or phylogenetic network for describing the evolutionary history of a set of species is a well-studied problem in computational biology. One previously proposed method to infer a phylogenetic tree/network for a large set of species is by merging a collection of known smaller phylogenetic trees on overlapping sets of species so that no (or as little as possible) branching information is lost. However, little work has been done so far on inferring a phylogenetic tree/network from a specified set of trees when in addition, certain evolutionary relationships among the species are known to be highly unlikely. In this paper, we consider the problem of constructing a phylogenetic tree/network which is consistent with all of the rooted triplets in a given set and none of the rooted triplets in another given set
. Although NP-hard in the general case, we provide some efficient exact and approximation algorithms for a number of biologically meaningful variants of the problem.
https://doi.org/10.1142/9781860947322_0035
Genome rearrangement is an important area in computational biology. There are three basic operations, reversal, translocation, and transposition. Here we study the translocation operations. Multi-chromosomal genomes frequently evolve by translocation events that exchange genetic material between two chromosomes. We focus on the signed case, where the direction of each gene is known. The signed translocation problem asks to find the minimum number of translocation operations as well as the sequence of translocation operations to transform one genome into the other. A linear-time algorithm that computes the the minimum number of translocation operations was given in Li et al., 2004.14 However, that algorithm cannot give the optimum sequence of translocation operations. The best known algorithm that can give the optimum sequence of translocation operations for signed translocation problem runs in O(n2 log n) time. In this paper, we design an O(n2) algorithm.
https://doi.org/10.1142/9781860947322_0036
Information of the structures and functions of protein molecules and their mutual interactions that construct protein networks increases rapidly as the consequence of the structural genomics and structural proteomics projects [1]. Advanced applications of such information require the Grid technology to solve the two problems: (i) the shortage of computational power, and (ii) the lack of a capability for seamlessly and quickly retrieving data from the varieties of heterogeneous biological databases [2].
https://doi.org/10.1142/9781860947322_0037
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes of the same families. In this paper, we present a new approach to ortholog assignment that takes into account both sequence similarity and evolutionary events at genome level, where orthologous genes are assumed to correspond to each other in the most parsimonious evolving scenario under genome rearrangement. It is then formulated as a problem of computing the signed reversal distance with duplicates between two genomes of interest, for which an efficient heuristic algorithm was given by introducing two new optimization problems, minimum common partition and maximum cycle decomposition. Following this approach, we have implemented a high-throughput system for assigning orthologs on a genome scale, called SOAR, and tested it on both simulated data and real genome sequence data. Compared to a recent ortholog assignment method based entirely on homology search (called INPARANOID), SOAR shows a marginally better performance in terms of sensitivity on the real data set because it was able to identify several correct orthologous pairs that were missed by INPARANOID. The simulation results demonstrate that SOAR in general performs better than the iterated exemplar algorithm in terms of computing the reversal distance and assigning correct orthologs.
https://doi.org/10.1142/9781860947322_0038
Different species may have developed specific modes of transcription initiation of their genes. We have considered two relatively distant vertebrate species, Fugu rubripes and human, and compared the content of their extended core promoters covering the range of [−70,+60] relative to experimental transcription start sites. Our study included over 18,000 human genes and 5,000 Fugu genes. We utilized TRANSFAC Professional database Ver.7.4 and all its available matrix models related to vertebrates, as well as the associated Match program ver.1.9. Promoter groups with different GC-content were analyzed in order to reduce potential nucleotide bias. We observed striking differences in the core promoter contents of these two species. We enumerated the most significant differences in transcriptional patterns of Fugu and human. We also determined species specific sets of transcriptional patterns that could be linked to transcription initiation, as well as the most frequent and, separately, the most over-represented patterns in various promoter groups. Our study should contribute to better understanding of the transcription initiation mechanisms in these two species, insights into the impact of 450 million years of evolution on transcription initiation, and better annotation of promoter transcriptional regulatory patterns in different species.
https://doi.org/10.1142/9781860947322_0039
The universe of biological data emanating from a variety of technologies continues to expand at almost alarming rates. Novel technologies addressing views of the complex biological systems in near molecular resolution have emerged with breakthroughs in the ultra-high throughput DNA sequencing, transcript (mRNA) profiling as whole genome biosensors, and protein profiling strategies based in the core technologies of analytical chemistry and mass spectrometry. We are now in a world where scientists can examine a biological system at a high level of resolution that exceeds their capacity to understand the results in supporting or refuting the hypothesis. The large volumes of data generated through these technologies have necessitated the need to be able to manage, analyze, synthesize, distribute and generate coherent knowledge. This presentation will review the challenges and opportunities in Bioinformatics, including issues with functional annotation of proteins, Systems Biology approaches to mine databases, Biomarker discovery, etc. Some of these challenges and ongoing efforts at addressing these in Singapore will be discussed.
https://doi.org/10.1142/9781860947322_bmatter
AUTHOR INDEX