![]() |
This unique volume presents major developments and trends in bioinformatics and its applications. Comprising high-quality scientific research papers and state-of-the-art survey articles, the book has been divided into five main sections: Microarray Analysis and Regulatory Networks; Machine Learning and Statistical Analysis; Biomolecular Sequence and Structure Analysis; Symmetry in Sequences; and Signal Processing, Image Processing and Visualization. The results of these investigations help the practicing biologist in many ways: in identifying unknown connections, in narrowing down possibilities for a search, in suggesting new hypotheses, designing new experiments, validating existing models or proposing new ones. It is an essential source of reference for researchers and graduate students in bioinformatics, computer science, mathematics, statistics, and biological sciences based on select papers from the “The International Conference on Bioinformatics and Its Application” (ICBA), held December 16–19, 2004 in Fort Lauderdale, Florida, USA.
https://doi.org/10.1142/9789812702098_fmatter
FOREWORD.
CONTENTS.
https://doi.org/10.1142/9789812702098_0001
Oligonucleotide gene expression microarray platforms typically measure gene expression as a magnitude proportional to transcript abundance. Studies attempting to find gene expression patterns in microarray data must use normalized data. Normalizing microarray gene expression data across multiple chips corrects for different levels of solution and other processing variations. One of the most common approaches to normalization is a linear scaling of expression values to a constant mean intensity. An alternative strategy that has become increasingly popular is the model-based approach, in which an expression value is normalized while taking into account the expression values on all chips. This process can lead to more accurate representations of gene expression, however the normalization technique is computationally demanding. We propose an alternate approach in which a microarray chip is self-normalized relative to the total quantity of RNA (as represented by total signal intensity) on the microarray. This process can correct for such problems as increased concentration levels while still preserving specificity. We experiment on a well-known benchmark dataset (the U133A spikein study from Affymetrix) and compare our proposed normalization method to the Affymetrix MAS 5.0 technique. We demonstrate that our proposed normalization method reduces the variability in technical replicate results and improves the detection of spikein values.
https://doi.org/10.1142/9789812702098_0002
In this paper, we propose a new subspace clustering algorithm, EPSCMIX (Emerging Partern Subspace Clustering by MIXture models), for solving the problem of clustering microarray expression data. Our method is based on a feature saliency measure which is defined as the probability of feature relevance and is estimated by an Expectation Maximization (EM) algorithm. As this approach employs Emerging Patterns (EP) to get a feature saliency, EPSCMIX can not only model interactions of genes, but also estimate the best number of classes using a meaningful certain set of genes related to each cluster. The robustness of our new approach is demonstrated on a widely-used dataset of colon tissues. Experimental results show the proposed method gives better accuracy in classification than emerging pattern based projected clustering (EPPC) or single EP.
https://doi.org/10.1142/9789812702098_0003
We present a new linear programming algorithm, that combines mass conservation and the law of entropy in large-scale metabolic networks to compute thermodynamically feasible reaction fluxes and their corresponding change in chemical potentials. To detect thermodynamic infeasibility for a flux vector, we formulate a sign test that can be applied in polynomial time with respect to the length of the flux vector. We illustrate our algorithm by applying it to the central metabolism of E. coli.
https://doi.org/10.1142/9789812702098_0004
The carbon fixation pathway plays an essential role in the primary production and natural carbon recycling process. In this paper, we present a computational model of the carbon fixation pathway in the marine cyanobacterium Synechococcus sp. WH 8102 (SYNWH8102), using our recently developed computational protocol for inference of regulatory and signaling pathways. The results of our pathway prediction include: (a) the pathway model consists of major components of the carbon fixation pathway reported in the literature. (b) Eight additional gene candidates were added into the model, making our pathway model more complete, (c) Two predicted transcription factor binding sites were found in the upstream regions of multiple genes involved in the carbon fixation pathway, (d) A list of more than 200 potential candidates associated with the pathway was also obtained for further experimental validation.
https://doi.org/10.1142/9789812702098_0005
Rapid progress in genome research has produced a huge amount of nucleotide sequence data, including hundreds of complete genomes (Entrez Genomes). There is therefore a need to automate the gene prediction process to ensure that the amount of information does not in itself become a problem. Many prediction engines are available to facilitate such identification, but their application scope and prediction capability vary. This paper investigates the potential to improve gene prediction by integrating three available engines, GrailEXP, Genscan, and MZEF by means of a modular mixture of expert (MoE) neural network, where the utilization of a modular architecture is able to directly support the partitioned nature of the original data sources. The three selected engines represent three different categories of the prediction software. We were able to demonstrate that the integration system has markedly better recovery (proportion of actual exons that are predicted) than any of the individual prediction engines alone. After integration, we were able to achieve a higher correlation coefficient of exon prediction and thus a higher accuracy of the results. The program is available on line at http://www.cbr.nrc.ca/pany/integ.html with links to the data used for this research.
https://doi.org/10.1142/9789812702098_0006
Microarray-based expression profiling is widely used in the study of molecular mechanism underlying various physiological and pathological processes. Understanding of microarray data is largely dependent on the identities of probes used to represent genes or transcripts. Unfortunately, different databases and even different version of the same database may assign different gene identity to a significant percentage of probes on popular microarray platforms. We compared gene definitions across major public domain clustering and genome annotation databases to reveal the extent of this problem. A web-based function for examining cross-database probe identity consistency is also developed to help microarray user to deal with this issue.
https://doi.org/10.1142/9789812702098_0007
Gene Set Enrichment Analysis (GSEA) was designed to detect subtle but coordinated differences in expression of a priori defined sets of functionally related genes [1]. While GSEA can provide significant insights for the molecular mechanisms underlying many pathophysiological processes, there are concerns about its bias toward large gene sets. Computation time associated with some probe level analysis algorithms such as fitPLM [2] is also a major concern. We carried out systematic analyses on important issues related to GSEA and optimized the GSEA procedure for web implementation.
https://doi.org/10.1142/9789812702098_0008
In this paper, we present a new approach that will facilitate the process of drawing meaningful conclusions that are likely to be useful to a biologist who may be interested in only a specific functional aspect of the organism (rather than the entire genome). This is achieved by constructing a comprehensive relational database and an intelligent, user-friendly query system. With the help of simple examples, we show how a biologist with no prior knowledge of databases can use this system. Our database can be queried from the URL: http://biorg.cs.fiu.edu/TFBM/.
https://doi.org/10.1142/9789812702098_0009
Disease monitoring plays a crucial role in the implementation of public health measures. The demographic profiles of the people and the disease prevalence in a geographic region are analyzed for inter-causal relationships. Bayesian analysis of the data identifies the pertinent characteristics of the disease under study. The vital components of control and prevention of the disease spread are identified by bayesian learning for the efficient utilization of the limited public health resources. Bayesian computing, layered with epidemiological expertise, provides the public health personnel to utilize their available resources optimally to minimize the prevalence of the disease. Bayesian analysis is implemented using synthetic data for two different demographic and geographic scenarios for pneumonia and influenza, that exhibit similar symptoms. The analysis infers results on the effects of the demographic parameters, namely ethnicity, gender, age, and income levels, on the evidence of the prevalence of the diseases. Bayesian learning brings in the probabilistic reasoning capabilities to port the inferences derived from one region to another.
https://doi.org/10.1142/9789812702098_0010
A stochastic model describing the planktonic interaction is presented. The environmental fluctuations are incorporated by perturbing the growth rate of phytoplankton and death rate of zooplankton with coloured noises. Spectral density functions are studied and statistical linearization technique is used to show the unstability and periodicity of the model system indicating the cyclic nature of bloom.
https://doi.org/10.1142/9789812702098_0011
Profile HMMs based on classical hidden Markov models have been widely studied for identification of protein sequence families. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. This assumption has been identified as a limitation affecting the performance of classical HMM. To overcome the limitation and improve performance of profile HMM, we propose fuzzy profile HMM by incorporating Sugeno fuzzy measures using Choquet integral based on generalized HMM in [5]. The proposed model was built and tested on widely studied globin family sequences and its performance was compared with classical HMM. Results in terms of LL scores and other performance metrics demonstrate the superiority of fuzzy profile HMM over classical HMM in its ability to distinguish member and non-member protein sequences.
https://doi.org/10.1142/9789812702098_0012
The popular Affymetrix human gene expression analysis arrays contain oligonucleotide probes that overlap with thousands of known single nucleotide polymorphism sites. These allele-specific probes provide the possibility of screening allele-frequency difference in existing GeneChip data. We explored analysis procedures that can detect allele-specific signals of known SNPs in different sample groups. Our analysis results and independent genotyping data show it is feasible to utilize these allele-specific probes to screen potential allele-frequency disequilibrium in large-scale GeneChip-based expression profiling projects.
https://doi.org/10.1142/9789812702098_0013
In this paper, we introduce an algorithm of Self-Organizing Maps(SOM) which can extract the feature of the DNA sequences . The DNA sequences are considered to have the special features depending on the regions where the sequences are taken from or the gene functions of the proteins which are translated from the sequences. If the hidden features of the DNA sequences are extracted from the DNA sequences, they can be used for predicting the regions or the functions of the unknown sequences. We have developed the algorithms which organize the probes using SOM. This algorithm can select smaller number of the probes from all combinations of nucleotides of given length. The probes are organized on the 2 dimensional maps depending on the similarities of the probes. In this paper we propose to use the maps for sequence analyses. And we propose the batch SOM algorithm which updates the map using simulated annealing method to organize the adjacent probes closely on the map. We made some analyses of the DNA sequences concerning the function of the translated proteins, the species of the sequences and the results are shown in this paper.
https://doi.org/10.1142/9789812702098_0014
Clustering data into natural groupings has important applications in fields such as Bioinformatics. Support Vector Clustering (SVC) does not require prior knowledge of a dataset and it can identify irregularly shaped cluster boundaries. A major SVC challenge is the choice of an important parameter value, the width of a kernel function that determines a nonlinear transformation of the input data. Since evaluating the result of a clustering algorithm is a highly subjective process, a collection of different parameter values must typically be examined. However, no algorithm has been proposed to specify the parameter values. This paper presents a secant-like numerical algorithm that generates an increasing sequence of SVC kernel width values. An estimate of sequence length depends on spatial characteristics of the data but not the number of data points or the data's dimensionality. The algorithm relies on a function that relates the kernel width value to the radius of the minimal sphere enclosing the images of data points in a high-dimensional feature space. Experimental results with 2D and higher-dimensional datasets suggest that the algorithm yields useful data clusterings.
https://doi.org/10.1142/9789812702098_0015
The ability of physicians to effectively treat and cure cancer is directly dependent on their ability to detect cancers at their early stages. The early detection of cancer had the potential to dramatically reduce mortality. Recently, the use of mass spectrometry to develop profiles of patient serum proteins has been reported as a promising method to achieve this goal. In this paper, we analyzed the ovarian cancer, and prostate cancer data sets using support vector machine (SVM) to detect cancer at the early stages based on serum proteomic pattern. The results showed that SVM, in general, performed well on these two data sets, as measured by sensitivity, specificity, positive predictive value, negative predictive value, and accuracy. Linear kernel worked the best on ovarian cancer data with a sensitivity of 0.99 and an accuracy of 0.97, while polynomial kernel worked the best on prostate cancer data with a sensitivity of 0.79 and an accuracy of 0.82. When redial kernel was applied to either of the two data sets, all the samples were predicted as cancer samples, with a sensitivity of 1 and a specificity of 0. Furthermore, feature selection did not improve SVM performance.
https://doi.org/10.1142/9789812702098_0016
Evidence from investigations of genetic differences among humans shows that genetic diseases are sometimes the result of genetic mutations that occur in more than one percent of a population (variations). The most common of these variations are single nucleotide polymorphisms (SNPs). A complete map of all SNPs in the human genome will be extremely valuable for studying specific haplotypes with specific genetic diseases. Distinguishing the information contained in both haplotypes when analyzing chromosome sequences poses several new computational issues which collectively form a new merging topic of Computational Biology known as Haplotyping. The recent discovery 6, 3, 8, 4 that genomic DNA can be partitioned into long blocks where genetic recombination has been rare had led to more meaningful mathematical research, which has compounded computational problems. We are interested in the algorithmic implications based on observing the long blocks of DNA that have not undergone recombination to infer haplotypes in populations. This assumption justifies a model of haplotype evolution - the haplotypes in a population are assumed to evolve along a coalescent, based on the standard population-genetic assumption of infinite sites, which as a rooted tree is a perfect phylogeny. The Perfect Phylogeny Haplotyping (PPH) problem, introduced by Gusfield 5, is "given n (number of) genotypes, does a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set exist, and such that this set can be derived on a perfect phylogeny?". Gusfield established a O(nm2)-time algorithm 2 that determines whether there is a PPH solution for input genotypes, and produces a linear-space data structure to represent all the solutions. He also conjectured that it is possible to solve the PPH problem in linear time 2 . In this paper, we will solve the conjecture of Gusfield and introduce a linear-time algorithm for PPH problem. We prove that a haplotype matrix that has a perfect phylogeny induces isomorphic posets by either taking rows or columns as the vertex set. Moreover, a genotype poset induced from a genotype matrix is a super poset of the haplotype poset induced from the feasible expansion of the genotype poset. After studying the hasse diagrams of genotype posets, we develop a new graph model and design a linear-time (O(nm)) algorithm to solve the Perfect Phylogeny Haplotyping problem. We believe our model can be further applied to solve other PPH-related problems, such as Pure Parsimony problem.
https://doi.org/10.1142/9789812702098_0017
In silico biology is a rapidly growing discipline in the post-genome and post-"google" era, which takes multi-leveled computational approaches that combine genome information with a myriad of bioloigical phenotypes. In particular, there has been a surging interest in tackling computational challenges in the usages of single nucleotide polymorphisms (SNPs) in haplotype reconstruction, disease gene mapping, and in exploratory analyses of the high-dimensional, genome-wide expression data. Three central themes of in silico biology are covered, (i) Haplotype Inference. Clark's algorithm, standard Expectation-Maximization (EM) algorithm, and Bayesian algorithms are briefly reviewed. In particular, the idea of using a novel partition-ligation (PL) approach implemented in both Bayesian and EM frameworks are illustrated. Moreover, a new genotype clustering algorithm -GeneScore, which is based on a mixture model of t-distributions is proposed for probabilistic genotype scoring and a new haplotype phasing algorithm - GS-EM, which takes the inputs of "GenoSpectrum", is introduced in haplotype phasing, (ii) Linkage Disequilibrium (LD) Analysis. Permutation tests, likelihood ratio tests, and machine learning approaches are articulated through practical examples, (iii) Microarray-Based Gene Selection Techniques. Statistical procedures used to select high-priority genes based on microarray data are summarized. Specifically, a new data-based transformation method and the use of a new concept of least significant false discovery rate (LS-FDR) in guiding gene selection are illustrated. Finally, challenges and future directions of in silico biology are presented in the concluding remarks.
https://doi.org/10.1142/9789812702098_0018
The microarray technology has paved the way to high-dimensional data analysis for molecular classification, the large number of features (genes) making the need for feature selection techniques more crucial than ever. From various ranking-based filter procedures to classifier-based wrapper techniques, many studies have devised their own flavor of feature selection methods. However, in the context of highly multiclass microarray data, only a handful of them have delved into the effect of redundancy in the feature set on classification accuracy. We present a generic, fast feature selection method for datasets with large number of classes. Our feature selection method is capable of achieving low-biased accuracies comparable to those of previous studies while at the same time providing insight into the effect of relationships between genes on the classification accuracies.
https://doi.org/10.1142/9789812702098_0019
Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. It is hence important to identify orthologs for transferring functional and other information between genes in different organisms with a high degree of reliability. For example, protein-protein interactions identified by high-throughput proteomics already covers three-quarters of yeast proteins to date. Similar information is being massively generated for other important model systems such as fly and worm. Mapping their orthologous counterparts in human will thus tremendously assists in the understanding of the biological functions of human genes at the systems level. Unfortunately, a confounding factor in this process is that cross-species comparisons often identify genes that, although highly similar, do not represent a true ortholog and may in fact be functionally dissimilar because of large numbers of paralogs within protein families. Existing clustering methods based on two-way best genome-wide similarities have so far not separated paralogs from orthologs effectively. We present a fully automatic computational method to cluster orthologs and in-paralogs from multiple species. We use the program BLASTP to generate a pairwise distance matrix, which is then normalized for each homologous group between and within the species included. We also used protein domains and their organizations in protein sequences as an additional criterion for filtering false relationships. Ortholog clusters are first seeded with multiple reciprocal best pairwise matches, after which the Markov graph-flow algorithm is applied to include in-paralogs. Classification parameters such as the inflation index are optimized according to the functional consistency in each of the clusters. This was inferred by the comparison of ontological annotations available for each of the sequences belonging to the same cluster. We also use existing structural classifications for proteins to validate our results. We apply our programs on six completely sequenced eukaryotic genomes, assigns confidence values for both orthologs and in-paralogs. We note significant improvement for the clustering of orthologs with recent paralogs, comparing our results with similar efforts at NCBI and TIGR. This provides an automatic and robust method to cluster orthologous proteins of multiple genomes.
https://doi.org/10.1142/9789812702098_0020
Complete genome sequences of several pathogenic bacteria have been determined, and many more such projects are currently under way. While these data potentially contain all the determinants of host-pathogen interactions and possible drug targets, computational tools for selecting suitable candidates for further experimental analyses are currently limited. Detection of bacterial genes that are non-homologous to human genes, are essential for the survival of the pathogen and are conserved across species represent a promising means of identifying novel drug targets. We have used four-way genome comparisons to identify essential genes from Salmonella species. Our approach identified 297 essential genes that may be considered as potential drug targets. It was interesting to see that many ORFs belong to the group of genes involved in protein biosynthesis, transcription and DNA replication. The resultant analyses are in good agreement with the results of experimental data. This approach enables rapid potential drug target identification, thereby greatly facilitating the search for new antibiotics. These results underscore the utility of large genomic databases for in silico systematic drug target identification in the post-genomic era.
https://doi.org/10.1142/9789812702098_0021
It is extremely challenging to design a machine learning algorithm that is able to generate tolerable error rates for the class prediction of various types of genetic data. A particular system may be very effective for one microarray dataset but fail to perform in a similar fashion for another dataset. This paper introduces a neural approach to use Generalized Regression Neural Network (GRNN), Collimator Neural Network (CNN) which provides consistent performance stability for various types of microarray data. CNN performance has been cross validated with the k-fold cross validation (leave one out) technique for BRCA1, BRCA2 and Sporadic mutation classification for ovarian and breast cancer data. The paper presents comparative classification results for binary and multiclass prediction by CNN against previously presented work and other well established classifiers including, Support Vector Machines (SVM), Probabilistic Neural Network (PNN), single GRNN and novel Concurrent PNN (CPNN). Simulation results confirm that CNN performed significantly better than SVM, PNN, GRNN and CPNN for both the breast cancer and ovarian cancer datasets.
https://doi.org/10.1142/9789812702098_0022
Supervised pattern discovery techniques have been successfully used for motif detection. However, this requires the use of an efficient training set. Even in cases where a lot of examples are known, using all the available examples can introduce bias during the training process. In practice, this is done with the help of domain experts. Whenever such expertise is not available, training sets are usually picked at random. We present a new strategy for designing good training sets that uses phylogenetic trees to automatically reduce the bias in training sets. When applied to helix-turn-helix motif detection, we show that the technique improved the error rates over a random choice of training set. More specifically, false positive rates show a marked improvement.
https://doi.org/10.1142/9789812702098_0023
This paper presents a gene selection approach for multiclass prediction with application to gene expression profiling. A two-step gene selection method is proposed to obtain improved diagnostics according to their joint discriminatory power of class separability. First, individually discriminatory genes (IDGs) are identified by using one-dimensional weighted Fisher criterion (1D-wFC). Second, jointly discriminatory genes (JDGs) are selected by sequential search methods, according to their joint class separability measured by multidimensional wFC. By applying this approach to a microarray study of small round blue cell tumors (SRBCTs), we have demonstrated that our two-step gene selection method can be used to successfully identify a subset of genes with superior classification performance for multiclass prediction.
https://doi.org/10.1142/9789812702098_0024
In our attempt to understand functional genomics, we must be able to integrate information regarding multi-functional natures of proteins. We propose a two-stage algorithm that uses both supervised and unsupervised learning techniques. The first stage partitions the feature space using a tree structure which we refer to as a Sequential Bifurcation Tree. The second stage then uses the tree as a supervised classifier which we refer to as multiply labeled instance classifier. We investigate the effectiveness of this classifier in learning protein functional classes, a challenging problem that is made more difficult by the fact that proteins can have multiple functions, and hence can have multiple functional class labels. We demonstrate the effectiveness of Sequential Bifurcation algorithm to handle such data and to learning protein functional classes. The Sequential Bifurcation approach has been compared favorably to more traditional machine learning approaches such as support vector machines and decision trees.
https://doi.org/10.1142/9789812702098_0025
This paper discusses identification of contaminants in MALDI-TOF mass spectrometry data derived from proteomic studies. Contaminant masses are usually submitted with valid peptide masses to protein identification search engines and can potentially lead to false positive results. The paper presents a method for automatic identification of contaminant masses so that they can be removed prior to the submission of the peak list for protein identification. The developed algorithm clustered mass values from over 3000 proteins and found 50 potential contaminant masses. Analysis of the data and comparison with results of others revealed that 29 of the 50 are true known contaminants, which validated the proposed method.
https://doi.org/10.1142/9789812702098_0026
By applying mathematical analysis to flow cytometry data, we revealed that, as expected, treatment with EPO increased the fraction of erythroid precursors in murine bone marrow. However, we also found the fraction of apoptotic and dead late stage erythroid precursors increased. Furthermore, we also demonstrated that EPO increased the ratio of dead polyploid to dead diploid erythroid cells, providing evidence that EPO increased the probability that dividing erythroid precursors will enter apoptosis, a finding not evident by expert human analysis. These results indicate that mathematical analysis can be a powerful approach for analyzing flow cytometry data and understanding complex biological systems.
https://doi.org/10.1142/9789812702098_0027
Information about local protein sequence motifs is very important to the analysis of biologically significant regions of protein sequences. In this work, recurring patterns of short protein segments are explored with the K-means clustering algorithm. A new greedy initialization algorithm is proposed to improve the performance of the K-means clustering technique for the first time. The percentage of sequence segments belonging to clusters with the structural homology greater than 60% for the improved K-means algorithm increases by 3.02% compared to the traditional K-means algorithm. The improved results show that the new algorithm does work. This research uses a much bigger and more advanced database than the previous study. The advanced database and improved K-means algorihtm may provide a good opportunity to reveal some subtle recurring sequence patterns undetectable by previous study. Experimental results indicate that some very meaningful sequence motifs representing common secondary and tertiary structures have been found in this work. These motifs are universally conserved sequence patterns across protein families.
https://doi.org/10.1142/9789812702098_0028
There is a strong need for computational methods to predict and characterize functional binding sites for the functional annotations of protein structures. In this paper, we propose a new method that uses descriptions of the sites based on local properties and support vector machines to predict and obtain key features of metal-binding sites. Previous approaches relied on conserved residues or conserved residue distribution in the protein three-dimensional structure for the predictions. All reported Ca2+, Zn2+, Mg2+, Mn2+, Cu2+ binding sites present in a non-redundant PDB (a total of 1144 metal-binding sites in 467 proteins) were used in our experiments. Results from ten-fold cross-validation showed sensitivity and specificity above 95%. Then, using feature selection methods, profiles of critical features were obtained for each metal-binding site. These profiles are consistent with the prior knowledge about metal-binding sites. Furthermore, they provide new insights into the microenvironments of the metal-binding sites.
https://doi.org/10.1142/9789812702098_0029
A MATLAB toolbox for macromolecular modeling is developed. The toolbox consists of routines for extracting and updating PDB data files, for calculating DME (Distance Matrix Error) and RMSD (Root-Mean-Square Deviation) of given structures, for building structural models with known inter-atomic distances, etc. Two algorithms, a singular-value decomposition algorithm and a geometric build-up algorithm, are used for distance-based structure modeling. A simulated annealing algorithm is implemented for energy minimization for structure refinement or determination. Functions facilitating structural conversion among three coordinate systems defined in terms of Cartesian coordinates, inter-atomic distances, and dihedral angles are also built in the toolbox. With these well-defined routines, this toolbox could be used to develop algorithms in structural bioinformatics under MATLAB working environment.
https://doi.org/10.1142/9789812702098_0030
We address the problem of stabbing a sequence of indexed balls in ℝ3, in increasing order of the indices in
, using a set of line or line segment stablers S = {s1, s2, ⋯, sm} such that each stabler is associated (starts, ends, or passes through) with the centers of the smallest and largest index balls that it stabs. The problem finds applications in simplification of molecule chains for visualization, modeling, manipulation, matching and efficient searching in molecule and protein databases.
https://doi.org/10.1142/9789812702098_0031
Calculated folding free energies of mRNA sequences from 33 human genes were each compared to ten step-wise partially shuffled versions of the same sequences, keeping the dinucleotide, or "neighbor", composition preserved. The first ten base-swap steps of the shuffling process were dissected by examining the folding free energies after each base-swap. This whole procedure was repeated a hundred times for each gene. Most of the genes showed a tendency for the average free energy to decline dramatically, suggesting that these natural mRNAs have been subject to selection to maximize the folding free energy. For most of the genes, the direction of average change for these first ten steps was accurately predicted by the mRNA folding Z-scores calculated for thoroughly shuffled sequences. Estimates ranged from 12 to 69.2 as to how many swap steps would be required to match the average free energy for fully shuffled sequences.
https://doi.org/10.1142/9789812702098_0032
In this paper, we consider PLAINS, an algorithm that provides efficient alignment over DNA sequences using piecewise-linear gap penalties that closely approximate more general and meaningful gap-functions. The innovations of PLAINS are fourfold. First, when the number of parts to a piecewise-linear gap function is fixed, PLAINS uses linear space in the worst case, and obtains an alignment that is provably correct under its memory constraints, and thus has an asymptotic complexity similar to the currently best implementations of Smith-Waterman. Second, we score alignments in PLAINS based on important segment pairs; optimize gap parameters based on interspecies alignments, and thus, identify more significant correlations in comparison to other similar algorithms. Third, we describe a practical implementation of PLAINS in the Valis multi-scripting environment with powerful and intuitive visualization interfaces, which allows users to view the alignments with a natural multiple-scale color grid scheme. Fourth, and most importantly, we have evaluated the biological utility of PLAINS using extensive lab results; we report the result of comparing a human sequence to a fugu sequence, where PLAINS was capable of finding more orthologous exon correlations than similar alignment tools.
https://doi.org/10.1142/9789812702098_0033
Protein secondary structure prediction (SSP) is often used as an initial step towards tertiary structure prediction and functional analysis. The accuracy of SSP is therefore important for the further structural and functional analysis. Although the overall prediction accuracy is about 76% for current SSP methods, the individual prediction accuracy varies from protein to protein. Some proteins are predicted more accurately than others. In this paper, we studied the possibility of identifying relatively poor SSP for helix when additional information, the length distribution of helices, of a protein is known. To simplify the evaluation of the prediction accuracy, we proposed using a simple parameter, Qunion, to evaluate the prediction accuracy for helix. Qunion combines two commonly used parameters, Qobs and Qpred, into one, and it is not aimed at distinguishing between over-prediction and under-prediction. Qunion can be used in the applications where over-prediction and under-prediction are equally considered as in-accurate prediction. We also proposed using a simple parameter, Ldiff, to evaluate roughly the accuracy of SSP for helix when the length distribution of helices is known for the protein. Our 49-protein test suggests that |Ldiff| roughly relates to Qunion, the parameter reflecting the prediction accuracy for helix, in an inverse relationship.
https://doi.org/10.1142/9789812702098_0034
A new method has been developed to find similar substructures in protein binding cavities. It is based on the idea that protein function is intimately related to the recognition and subsequent response to the binding of an endogenous ligand in a well-characterized binding pocket. It can therefore be assumed that proteins having similar binding cavities also bind similar ligands and exhibit related function. For the comparison of the binding cavities, the binding-site exposed physicochemical characteristics are described by assigning generic pseudocenters to the functional groups of the amino acids flanking a particular active site. These pseudocenters are assembled into small substructures. To find substructures with spatial similarity and appropriate chemical properties, an emergent self-organizing map is used for clustering. Two substructures which are found to be similar form the basis for an expanded comparison of the complete cavities. Preliminary results with four pairs of binding cavities show that similarities are detected correctly and motivate further studies.
https://doi.org/10.1142/9789812702098_0035
Parallel Tempering (PT) is an effective algorithm to overcome the slow convergence in low-temperature protein simulation by initiating multiple systems to run at multiple temperature levels and randomly switch with neighbor temperature levels. We implemented the PT scheme in the Rosetta to explore the rough energy landscape in protein folding and to improve the success rate of Rosetta in topologically complex structures. Compared to the original Simulated Annealing (SA) scheme in Rosetta, our preliminary computational results show that the PT scheme in Rosetta program exhibits a wider range sampling in the potential energy surface in protein folding.
https://doi.org/10.1142/9789812702098_0036
This paper presents a novel approach to epitope prediction based on the clustering of known T-cell epitopes for a given MHC class I allele (HLA-A*0201). A combination of association rules (ARs) and sequence-structure patterns (SSPs) was used to do the clustering of training set epitopes from the Antijen database. A regression model was then built from each cluster and a peptide from the test set was declared to be an epitope only if one or more of the models gave a positive prediction. The sensitivity (TP/TP+FN) of the AR/SSP regression models approach was higher than that of a single regression model built on the entire training set, and was also higher than the sensitivity measures for SYFPEITHI, Rankpep, and ProPred1 on the same test set.
https://doi.org/10.1142/9789812702098_0037
De novo peptide sequencing that determines the amino acid sequence of a peptide via tandem mass spectrometry (MS/MS) has been increasingly used nowadays in proteomics for protein identification. Current de novo methods generally employ a graph theory which usually produces a large number of candidate sequences and causes heavy computational cost while trying to determine a sequence with less ambiguity. In this paper, we will present an efficient and effective de novo sequencing algorithm which greatly reduces the number of candidate sequences. By utilizing certain properties of b- and y-ion series in MS/MS spectrum, we first propose a reliable two-way parallel searching algorithm to filter out the peptide candidates which are further pruned by an intensity evidence based screening criterion. Then, the best candidate is singled out using a scoring function by consideration of total intensity evidence within certain local region. Results of our algorithm were compared with those of PEAKS, a well-known de novo sequencing software. Experimental results demonstrated the efficiency and potency of our approach.
https://doi.org/10.1142/9789812702098_0038
Tandem mass spectrometry has been recognized as a useful technique for the characterization of the glycan structures. However, current softwares for automated interpretation of tandem mass spectra of glycans are limited by the size, biosynthetic rules, and fragmenting type of glycans. We developed a robust, fully-automated computer software program, GlycoMaster, to determine the glycan structure from mass spectra, in which a heuristic dynamic algorithm is used. Experiments showed that GlycoMaster can very accurately elucidate glycan structures from tandem mass spectra with those limitations. In this paper, we describe the algorithm.
https://doi.org/10.1142/9789812702098_0039
We consider the problem of explicitly enumerating and counting the assembly pathways by which an icosahedral viral shell forms from identical constituent protein monomers. This poorly understood assembly process is a remarkable example of symmetric macromolecular self-assembly occuring in nature and possesses many features that are desirable while engineering self-assembly at the nanoscale.
We use the new model of [24,25] that employs a static geometric constraint graph to represent the driving (weak) forces that cause a viral shell to assemble and hold it together. The model was developed to answer focused questions about the structural properties of the most probable types of successful assembly pathways. Specifically, the model reduces the study of pathway types and their probabilities to the study of the orbits of the automorphism group of the underlying geometric constraint graph, acting on the set of pathways.
Since these are highly symmetric polyhedral graphs, it seems a viable approach to explicitly enumerate these orbits and count their sizes. The contribution of this paper is to isolate and simplify the core combinatorial questions, list related work and indicate the advantages of an explicit enumerative approach.
https://doi.org/10.1142/9789812702098_0040
With the rapidly increasing number of available whole genome sequences, and advances in technology, it is practical and also desirable to derive valuable information regarding genetic inheritance and polymorphic relationships among organisms by aligning multiple related genomes. Several applications have been produced recently for multiple whole genome alignment. However, these applications assume collinearity among multiple genomes without considering possible genome rearrangement events. So these applications cannot detect the rearrangement events during alignment. Therefore, their results may be incomplete, or even contain misleading information from the perspective of biology. In this paper we describe a new method, EMAGEN-R, which can align multiple genomes and identify rearrangement events during alignment. The experimental results show that EMAGEN-R can successfully detect the rearrangement events and generate improved alignments. Coverage of the alignments is increased, and conservation among the multiple genomes is more accurately disclosed.
https://doi.org/10.1142/9789812702098_0041
Developing a reliable gene finding system is a very important task in gaining valuable information from non-annotated genome DNA sequence. The success of such a system depends heavily on the accuracy of splice site predictions. Despite extensive research on gene finding, current splice site prediction systems do not yield reliable accuracy. Due to the extremely high imbalance ratio of splice sites to non-splice sites, they tend to have either high false positive rate and/or low overall accuracy. In this paper, we propose several new features that are more indicative of the presence of splice sites. More importantly, we also present a simple but effective filtering approach for dealing with the imbalance problem. We show that our approach, coupled with the new features, significantly outperforms three existing popular techniques on the human genome sequences in terms of overall accuracy as well as false positive rate.
https://doi.org/10.1142/9789812702098_0042
Protein sequences and their structures can be largely described as combinations of conserved protein domains. Although only a very small number of protein structures (<20,000) have been determined using experimental methods, already more than half of the known protein domains can be found in the structural database. This provides a rich source of three-dimensional templates for the potential modeling of at least half of all the conserved protein domains. Here, we searched the entire Protein Data Bank (PDB) for all the InterPro (protein domain database) entries. Similar protein domains with structural information were clustered thus PDB partitioned. In each of the resulting domain clusters, a multiple structural alignment was constructed based only on the 3D positions for all the residues of the same domains involved. The overall goal of this study is thus to use these structure alignments as anchors to increase the alignment accuracy for a query with its 3D template required in the homology-based structural modeling. Here we report 1) the construction of such a structural library for all the known protein domains; 2) the use of a structural alignment (instead of sequence alignment) to select and map optimal templates; and 3) our validation using know structures as benchmarks to assess the modeling outcome. Our preliminary results show this is a promising method aiming at the prediction for a majority of known protein domains.
https://doi.org/10.1142/9789812702098_0043
A comparison is made among diverse six-dimensional Boolean representations of the genetic code. These models were incorporated in an interactive computational tool, HyperGCode, to graphically display the topological relations among codons, amino acids, and amino acid properties. After a description of the tool's capabilities, I demonstrate its usefulness with one example. I make a comparative analysis of mutations in β-lactamases and show that, both natural mutants and those produced by directed evolution are related to one another by one and two-bit changes in their codons. The Darwinian solutions to the β-lactamase design problem, in nature and in vitro, exhibit mainly conservative amino acid substitutions in functionally important positions.
https://doi.org/10.1142/9789812702098_0044
In order to get a new insight into the nature of the canonical set of twenty amino acids we undertook an attempt to build up a spatial structure arranging the set of amino acids, like it had been done for duplets and triplets of the genetic code. Analysis of the properties of a set of 12 meridian cycles obtained on the structure of the duplet genetic code, which is isomorphic to Boolean hypercube B4, revealed four groups of cycles united in pairs by anti-symmetry transformations of two types. These transformations become most illustrative when shown on the icosahedron, a polyhedron with 12 vertices. Related to the icosahedron is another polyhedron – dodecahedron, which has 20 vertices. Approach based on the use of the two polyhedrons was applied to the analysis of structure of the canonical set of 20 amino acids. It was demonstrated that four groups of amino acids, each containing five amino acids connected by anti-symmetry transformations of two types, can be distinguished in the initial set. The revealed principles were pictorially represented on the structure of dodecahedron.
https://doi.org/10.1142/9789812702098_0045
The replication of molecular texts is the most fundamental reaction in any conceivable chemical biology. Considered in abstracta, the process involves two essential components, an n-letter alphabet serving to express information, and an associated set of permutations which, applied repeatedly or in suitable combinations, reproduce an original text. Polymerase, acting on A, C, G and U/T, is nature's permutation of choice, but other alphabets and permutations are available. This paper considers the role of symmetry and informatics in limiting both the composition of viable alphabets, and the permutations which might reasonably find molecular expression.
https://doi.org/10.1142/9789812702098_0046
A particular arrangement of the 64-codon system allows us to extract an exact Euler's relation for Yang's 28-gon polyhedral model of the genetic code as well as the correct degeneracy pattern of this latter, using the nucleons or the atoms carbon and also nitrogen oxygen and sulfur, as digital signs. This "sui-generis" numerical information is preserved by our dihedral D8 (pivotal) symmetry.
https://doi.org/10.1142/9789812702098_0047
Systems of elements of genetic code are studied by their cognitive presentation in a form of mathematical matrices of symbolic and numerical kinds. This cognitive form of data presentation permits to discover new phenomenological rules of evolution of genetic codes, to reveal two branches of evolution within genetic code, to present hidden interrelations between the golden section and parameters of genetic polyplets, to disclose matrices of a hyperbolic turn in genetic matrices, etc. Mysterious sets of structures, realized by the nature in a hierarchical system of genetic codes, can be confronted by a heuristic manner with families of mathematical matrices, which contain elements of these structures. A few rules of degeneracy and segregations of genetic codes are revealed in this direction. A new answer on the fundamental question - "why 20 amino acids?" - is proposed as well.
https://doi.org/10.1142/9789812702098_0048
A large body of literature relates physical and chemical properties of the amino acids and protein folding. However, relatively little is known about the relationships of the codons and secondary protein structure. Nucleotide strings based protein folding prediction is important because the Genome Project has resulted in a large number of gene sequences that code for different proteins, of often unknown structure and function. We investigated how symbolic encoding of the nucleotides and amino acids influences the structure of protein. Genetic code randomization analysis with respect to the distinguishing between all α- and all β-protein fold classes indicated that: (a) there is a very low chance that a better code than the one specified by the nature is randomly produced; and (b) basic protein units with respect to the genetic coding of α- and β-protein fold classes are not monomers (single amino acids) but dipeptides and tripeptides.
https://doi.org/10.1142/9789812702098_0049
The goal of this research was the investigation of the relationships of the distances measured between the different amino acids in the protein of 7,964 tripeptides from the Protein Data Base (PDB). As a first step into this investigation, we developed a program capable of calculating the types of two triangles based on the twelve distances measured between the different amino acids in the protein. The second objective was to use an unsupervised neural network to cluster tripeptides based on the same input data. The selected for this purpose Self-Organizing Maps network was successful in categorizing the data in close approximation to the results achieved by the program.
https://doi.org/10.1142/9789812702098_0050
Our multidisciplinary effort has allowed a novel identification of the unprecedented quasi-28-gonal D2h symmetric feature and a "primordial codon core" of the genetic code, which are inherently associated with two hypothetical evolutionary axes, evidently from simple to complex, and the Fibonacci-Lucas sequence-restricted non-random selection of the side chain carbon atoms in the 20 amino acids.
https://doi.org/10.1142/9789812702098_0051
The purpose of this study was to correlate the Revised Movement Imagery Questionnaire (MIQ-R) with lower and upper alpha power to determine the validity of alpha activity as a measure of imagery ability. EEG activity was recorded from 30 right-handed volunteers. Results revealed that lower alpha activity was significantly correlated (p<.01) with MIQ-R kinesthetic subscale at both the right and left parietal sites and lower alpha activity was significantly correlated (p<.05) with MIQ-R visual subscale at the right and left occipital and parietal regions. The results support previous research showing that both left and right hemispheres are activated during imagery processing. These findings corroborate the idea that a verbal component is involved in imagery rehearsal.
https://doi.org/10.1142/9789812702098_0052
Bioinformatics improves research in Biology by employing information technology for the development of effective data analysis tools. To aid analysis of multiple organism ontologies a visualisation prototype was developed that provides intuitive identification of relationships within the data. Limitations in analysis in 2D led to an extension to the third dimension, with improved analysis that helps to manage complexity inherent in 3D.
https://doi.org/10.1142/9789812702098_0053
This work focuses on noninvasive functional mapping of the human brain integrating in three dimensions (3-D) optical topographic maps to anatomical Magnetic Resonance Imaging (MRI). MRI provides information about the anatomical structure of the brain, while the Optical Topographic System (OTS) displays the changes in the cerebral blood flow in the form of a topographic image. The fusion of both modalities reveal excellent prospects in localizing the activation that is seen by (OTS) on the 3D head model constructed from MRI slices. The proposed OTS-MRI integration allows the investigation of a wide range of motor, sensory, and cognitive functions as well as providing vivid information about the anatomical structure of the brain as revealed through the different modalities. This task demands the development of a new algorithm that registers accurately the 3-D head model to the topographic image. The superimposition of topographic images with structural MRI allows doctors to identify which regions in the brain correspond to the activation shown in the topographic image. In this way a comparison will be made between the two functional modalities as means to validate the results.
https://doi.org/10.1142/9789812702098_0054
Biological databases have become increasingly important distributed resources in this field. These databases contain a mix of data types, including sequence data (DNA, RNA and protein sequences), structured data such as molecular weights or GC content, and annotations in terms of controlled vocabularies and, increasingly, ontologies such as the Gene Ontology, as well as free text data in comment fields. The genomic information resulting from this, show however, some characteristics those make them very difficult to interpret and to exploit. They are stored in huge amount of data, are heterogeneous, and are geographically distributed. This paper proposes a combination of two systems, GenoMEDIA and Genome3DExplorer, addressing the problem of integrating and visualization of these heterogeneous databases and software utilities within a generic distributed platform. GenoMEDIA provides two specific middleware, named Lydia and Antje. Lydia offers some facilities of integration and collaboration between Services and Databases, and Antje allows users to manage multiple heterogeneous remote databases in a uniform vision. Genome3DExplorer is a user-friendly visualizer of this information within an immersive environment. The visualization is based on a well-adapted graphical paradigm, suitable to build a graph-based representation. Genome3DExplorer allows biologists to explore easily huge sets of genomic data and their relationships.
https://doi.org/10.1142/9789812702098_bmatter
AUTHOR INDEX.