Oligo kernels for biological sequence classification have a high discriminative power. A new parameterization for the K-mer oligo kernel is presented, where all oligomers of length K are weighted individually. The task specific choice of these parameters increases the classification performance and reveals information about discriminative features. For adapting the multiple kernel parameters based on cross-validation the covariance matrix adaptation evolution strategy is proposed. It is applied to optimize the trimer oligo kernels for the detection of bacterial gene starts. The resulting kernels lead to higher classification rates, and the adapted parameters reveal the importance of particular triplets for classification, for example of those occurring in the Shine-Dalgarno Sequence.
This study borrowed a technique from molecular sequence analysis and applied it to genre identification, which is the process of determining the type or family of a given document. For example, is the document a letter, a news story, a horoscope, a joke, an advertisement, a pornographic story, etc. Genre identification allows a computer user to further filter email and web sites in a way that is totally different from topic-based methods. This study presents original research in an application of machine learning to the genre identification problem. The specific method selected for the application was neural modeling.
The data for the study came from a database constructed by the author and his colleagues. The data consisted of descriptive features and the genre classification, as judged by a human, from over 5,000 different documents. Ten different genres were represented. The descriptive features consisted of 89 different measurements of each document such as average word length, the number of numeric terms, the proportion of present tense verbs, etc. The data was divided into two sets, with 75% set aside for training and 25% reserved separately for testing.
The first neural network applied was a very basic single layer network that achieved 79% correct classifications on the testing data. This performance was equivalent to the previous best method on the problem, decision trees. When more complex neural networks were applied to the problem, performance increased significantly. The best performance of 86% correct classifications was achieved by a network with a single hidden layer of 300 units. Increasing the number of hidden layers, or changing the number of hidden units did not improve performance. The best score is also a significant improvement over scores obtained from topic-based filters.
The neural networks were further used to determine which input features were most influential in the classifications by the networks. The average magnitude of the weights coming from each feature was computed after training. The analysis indicated that 10% of the features were not of any use to the networks. An additional 10% were very influential and were responsible for most of the performance of the networks. The remaining 80% varied between marginally useful to useful.
The analysis of the features indicated that second-order information was being exploited by the networks for better performance. This means that on this problem, neural networks will outperform statistical models or other methods that only utilize first-order information.
Let P,Q ⊆ ℝ2 be two n-point multisets and Ar ≥ b be a set of λ inequalities on x and y, where A ∈ ℝλ×2, , and b ∈ ℝλ. Define the constrained Minkowski sum(P ⊕ Q)Ar≥b as the multiset {(p + q)|p ∈ P, q ∈ Q,A(p + q) ≥ b}. Given P, Q, Ar ≥ b, an objective function f : ℝ2 → ℝ, and a positive integer k, the MINKOWSKI SUM SELECTION problem is to find the kth largest objective value among all objective values of points in (P ⊕ Q)Ar≥b. Given P, Q, Ar ≥ b, an objective function f : ℝ2 → ℝ, and a real number δ, the MINKOWSKI SUM FINDING problem is to find a point (x*, y*) in (P ⊕ Q)Ar≥b such that |f(x*,y*) - δ| is minimized. For the MINKOWSKI SUM SELECTION problem with linear objective functions, we obtain the following results: (1) optimal O(n log n)-time algorithms for λ = 1; (2) O(n log2 n)-time deterministic algorithms and expected O(n log n)-time randomized algorithms for any fixed λ > 1. For the MINKOWSKI SUM FINDING problem with linear objective functions or objective functions of the form
, we construct optimal O(n log n)-time algorithms for any fixed λ ≥ 1. As a byproduct, we obtain improved algorithms for the LENGTH-CONSTRAINED SUM SELECTION problem and the DENSITY FINDING problem.
Evolution of influenza viruses is a highly complex process that is still poorly understood. Multiyear persistence of similar variants and accumulating evidences of existence of multigenic traits indicates that influenza viruses operate as integrated units and not only as sets of distinct genes. However, there is still no consensus on whether it is the case, and to what extent. One of the main problems is the lack of framework for analyzing and interpreting large body of available high dimensional genomic, clinical and epidemiological data. By reducing dimensionality of data we intend to show whether in addition to gene-centric selective pressure, the evolution of influenza RNA segments is also shaped by their mutual interactions. Therefore, we will analyze how different complexity/entropy measures (Shannon entropy, topological entropy and Lempel–Ziv complexity) can be used to study evolution of nucleotide segments of different influenza subtypes, while reducing data dimensionality. We show that, at the nucleotide level, multiyear clusters of genome-wide entropy/complexity correlations emerged during the H1N1 pandemic in 2009. Our data are the first empirical results that indirectly support the suggestion that a component of influenza evolutionary dynamics involves correlation between RNA segments. Of all used complexity/entropy measures, Shannon entropy shows the best correlation with epidemiological data.
We describe an exhaustive and greedy algorithm for improving the accuracy of multiple sequence alignment. A simple progressive alignment approach is employed to provide initial alignments. The initial alignment is then iteratively optimized against an objective function. For any working alignment, the optimization involves three operations: insertions, deletions and shuffles of gaps. The optimization is exhaustive since the algorithm applies the above operations to all eligible positions of an alignment. It is also greedy since only the operation that gives the best improving objective score will be accepted. The algorithms have been implemented in the EGMA (Exhaustive and Greedy Multiple Alignment) package using Java programming language, and have been evaluated using the BAliBASE benchmark alignment database. Although EGMA is not guaranteed to produce globally optimized alignment, the tests indicate that EGMA is able to build alignments with high quality consistently, compared with other commonly used iterative and non-iterative alignment programs. It is also useful for refining multiple alignments obtained by other methods.
Identifying regions of DNA with extreme statistical characteristics is an important aspect of the structural analysis of complete genomes. Linguistic methods, mainly based on estimating word frequency, can be used for this as they allow for the delineation of regions of low complexity. Low complexity may be due to biased nucleotide composition, by tandem- or dispersed repeats, by palindrome-hairpin structures, as well as by a combination of all these features. We developed software tools in which various numerical measures of text complexity are implemented, including combinatorial and linguistic ones. We also added Hurst exponent estimate to the software to measure dependencies in DNA sequences. By applying these tools to various functional genomic regions, we demonstrate that the complexity of introns and regulatory regions is lower than that of coding regions, whilst Hurst exponent is larger. Further analysis of promoter sequences revealed that the lower complexity of these regions is associated with long-range correlations caused by transcription factor binding sites.
The crucial role played by the analysis of microbial diversity in biotechnology-based innovations has increased the interest in the microbial taxonomy research area. Phylogenetic sequence analyses have contributed significantly to the advances in this field, also in the view of the large amount of sequence data collected in recent years. Phylogenetic analyses could be realized on the basis of protein-encoding nucleotide sequences or encoded amino acid molecules: these two mechanisms present different peculiarities, still starting from two alternative representations of the same information. This complementarity could be exploited to achieve a multimodal phylogenetic scheme that is able to integrate gene and protein information in order to realize a single final tree. This aspect has been poorly addressed in the literature. In this paper, we propose to integrate the two phylogenetic analyses using basic schemes derived from the multimodality fusion theory (or multiclassifier systems theory), a well-founded and rigorous branch for which its powerfulness has already been demonstrated in other pattern recognition contexts. The proposed approach could be applied to distance matrix–based phylogenetic techniques (like neighbor joining), resulting in a smart and fast method. The proposed methodology has been tested in a real case involving sequences of some species of lactic acid bacteria. With this dataset, both nucleotide sequence– and amino acid sequence–based phylogenetic analyses present some drawbacks, which are overcome with the multimodal analysis.
Subsequent duplication events are responsible for the evolution of the minisatellite maps. Alignment of two minisatellite maps should therefore take these duplication events into account, in addition to the well-known edit operations. All algorithms for computing an optimal alignment of two maps, including the one presented here, first deduce the costs of optimal duplication scenarios for all substrings of the given maps. Then, they incorporate the pre-computed costs in the alignment recurrence. However, all previous algorithms addressing this problem are dependent on the number of distinct map units (map alphabet) and do not fully make use of the repetitiveness of the map units. In this paper, we present an algorithm that remedies these shortcomings: our algorithm is alphabet-independent and is based on the run-length encoding scheme. It is the fastest in theory, and in practice as well, as shown by experimental results. Furthermore, our alignment model is more general than that of the previous algorithms, and captures better the duplication mechanism. Using our algorithm, we derive a quantitative evidence that there is a directional bias in the growth of minisatellites of the MSY1 dataset.
Computational tools are essential components of modern biological research. For example, BLAST searches can be used to identify related proteins based on sequence homology, or when a new genome is sequenced, prediction models can be used to annotate functional sites such as transcription start sites, translation initiation sites and polyadenylation sites and to predict protein localization. Here we present Sirius Prediction Systems Builder (PSB), a new computational tool for sequence analysis, classification and searching. Sirius PSB has four main operations: (1) Building a classifier, (2) Deploying a classifier, (3) Search for proteins similar to query proteins, (4) Preliminary and post-prediction analysis. Sirius PSB supports all these operations via a simple and interactive graphical user interface. Besides being a convenient tool, Sirius PSB has also introduced two novelties in sequence analysis. Firstly, genetic algorithm is used to identify interesting features in the feature space. Secondly, instead of the conventional method of searching for similar proteins via sequence similarity, we introduced searching via features' similarity. To demonstrate the capabilities of Sirius PSB, we have built two prediction models — one for the recognition of Arabidopsis polyadenylation sites and another for the subcellular localization of proteins. Both systems are competitive against current state-of-the-art models based on evaluation of public datasets. More notably, the time and effort required to build each model is greatly reduced with the assistance of Sirius PSB. Furthermore, we show that under certain conditions when BLAST is unable to find related proteins, Sirius PSB can identify functionally related proteins based on their biophysical similarities. Sirius PSB and its related supplements are available at:
Many recent studies have shown that access of animal microRNAs (miRNAs) to their complementary sites in target mRNAs is determined by several sequence-specific determinants beyond the seed regions in the 5′ end of miRNAs. These factors have been related to the repressive power of miRNAs and used in some programs to predict the efficacy of miRNA complementary sites. However, these factors have not been systematically examined regarding their capacities for improving miRNA target prediction. We develop a new miRNA target prediction algorithm, called Hitsensor, by incorporating many sequence-specific features that determine complementarities between miRNAs and their targets, in addition to the canonical seed regions in the 5′ ends of miRNAs. We evaluate the performance of our algorithm on 720 known animal miRNA:target pairs in four species, Homo sapiens, Mus musculus, Drosophila melanogaster and Caenorhabditis elegans. Our experimental results show that Hitsensor outperforms five popular existing algorithms, indicating that our unique scheme for quantifying the determinants of complementary sites is effective in improving the performance of a miRNA target prediction algorithm. We also examine the effectiveness of miRNA-mediated repression for the predicted targets by using a published quantitative protein expression dataset of miR-223 knockout in mouse neutrophils. Hitsensor identifies more targets than the existing algorithms, and the predicted targets of Hitsensor show comparable protein level changes to those of the existing algorithms.
The Shannon entropy is a common way of measuring conservation of sites in multiple sequence alignments, and has also been extended with the relative Shannon entropy to account for background frequencies. The von Neumann entropy is another extension of the Shannon entropy, adapted from quantum mechanics in order to account for amino acid similarities. However, there is yet no relative von Neumann entropy defined for sequence analysis.
We introduce a new definition of the von Neumann entropy for use in sequence analysis, which we found to perform better than the previous definition. We also introduce the relative von Neumann entropy and a way of parametrizing this in order to obtain the Shannon entropy, the relative Shannon entropy and the von Neumann entropy at special parameter values. We performed an exhaustive search of this parameter space and found better predictions of catalytic sites compared to any of the previously used entropies.
Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyze protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.
The temperature in the Arctic region has been increasing in the recent past accompanied by melting of its glaciers. We took a snapshot of the current microbial inhabitation of an Alaskan glacier (which can be considered as one of the simplest possible ecosystems) by using metagenomic sequencing of 16S rRNA recovered from ice/snow samples. Somewhat contrary to our expectations and earlier estimates, a rich and diverse microbial population of more than 2,500 species was revealed including several species of Archaea that has been identified for the first time in the glaciers of the Northern hemisphere. The most prominent bacterial groups found were Proteobacteria, Bacteroidetes, and Firmicutes. Firmicutes were not reported in large numbers in a previously studied Alpine glacier but were dominant in an Antarctic subglacial lake. Representatives of Cyanobacteria, Actinobacteria and Planctomycetes were among the most numerous, likely reflecting the dependence of the ecosystem on the energy obtained through photosynthesis and close links with the microbial community of the soil. Principal component analysis (PCA) of nucleotide word frequency revealed distinct sequence clusters for different taxonomic groups in the Alaskan glacier community and separate clusters for the glacial communities from other regions of the world. Comparative analysis of the community composition and bacterial diversity present in the Byron glacier in Alaska with other environments showed larger overlap with an Arctic soil than with a high Arctic lake, indicating patterns of community exchange and suggesting that these bacteria may play an important role in soil development during glacial retreat.
Identification of transcription factor binding sites or biological motifs is an important step in deciphering the mechanisms of gene regulation. It is a classic problem that has eluded a satisfactory and efficient solution. In this paper, we devise a three-phase algorithm to mine for biologically significant motifs. In the first phase, we generate all the possible string motifs, this phase is followed by a filtering process where we discard all motifs that do not meet the constraints. And in the final phase, motifs are scored and ranked using a combination of stochastic techniques and p-value. We show that our method outperforms some very well-known motif discovery tools, e.g. MEME and Weeder on well-established benchmark data suites. We also apply the algorithm on the non-coding regions of M. tuberculosis and report significant motifs of size 10 with excellent p-values in a fraction of the time MEME and MoSDi did. In fact, among the best 10 motifs (p-value wise) in the non-coding regions of M. tuberculosis reported by the tools MEME, MoSDi and ours, five were discovered by our approach which included the third and the fourth best ones. All this in 1/17 and 1/6 the time which MEME and MoSDi (respectively) took.
Presynaptic and postsynaptic neurotoxins are two types of neurotoxins from venomous animals and functionally important molecules in the neurosciences; however, their experimental characterization is difficult, time-consuming, and costly. Therefore, bioinformatics tools that can identify presynaptic and postsynaptic neurotoxins would be very useful for understanding their functions and mechanisms. In this study, we propose Pippin, a novel machine learning-based method that allows users to rapidly and accurately identify these two types of neurotoxins. Pippin was developed using the random forest (RF) algorithm and evaluated based on an up-to-date dataset. A variety of sequence and motif features were combined, and a two-step feature-selection algorithm was employed to characterize the optimal feature subset for presynaptic and postsynaptic neurotoxin prediction. Extensive benchmark tests illustrate that Pippin significantly improved predictive performance as compared with six other commonly used machine-learning algorithms, including the naïve Bayes classifier, Multinomial Naïve Bayes classifier (MNBC), AdaBoost, Bagging, K-nearest neighbors, and XGBoost. Additionally, we developed an online webserver for Pippin to facilitate public use. To the best of our knowledge, this is the first webserver for presynaptic and postsynaptic neurotoxin prediction.
Background: Phosphorylation of histidine residues plays crucial roles in signaling pathways and cell metabolism in prokaryotes such as bacteria. While evidence has emerged that protein histidine phosphorylation also occurs in more complex organisms, its role in mammalian cells has remained largely uncharted. Thus, it is highly desirable to develop computational tools that are able to identify histidine phosphorylation sites. Result: Here, we introduce PROSPECT that enables fast and accurate prediction of proteome-wide histidine phosphorylation substrates and sites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neural network (CNN)-based classifiers and a random forest-based classifier. Three features, including the one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers, respectively. Our results show that it is able to accurately predict histidine phosphorylation sites from sequence information. Our PROSPECT web server is user-friendly and publicly available at http://PROSPECT.erc.monash.edu/. Conclusions: PROSPECT is superior than other pHis predictors in both the running speed and prediction accuracy and we anticipate that the PROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.
Sorting signals are crucial for the anchoring of proteins to the cell surface in archaea and bacteria. These proteins often feature distinct motifs at their C-terminus, cleaved by sortase or sortase-like enzymes. Gram-positive bacteria exhibit the LPXTGX consensus motif, cleaved by sortases, while Gram-negative bacteria employ exosortases recognizing motifs like PEP. Archaea utilize exosortase homologs known as archaeosortases for signal anchoring. Traditionally identification of such C-terminal sorting signals was performed with profile Hidden Markov Models (pHMMs). The Cell-Wall PREDiction (CW-PRED) method introduced for the first time a custom-made class HMM for proteins in Gram-positive bacteria that contain a cell wall sorting signal which begins with an LPXTG motif, followed by a hydrophobic domain and a tail of positively charged residues. Here we present a new and updated version of CW-PRED for predicting C-terminal sorting signals in Archaea, Gram-positive, and Gram-negative bacteria. We used a large training set and several model enhancements that improve motif identification in order to achieve better discrimination between C-terminal signals and other proteins. Cross-validation demonstrates CW-PRED’s superiority in sensitivity and specificity compared to other methods. Application of the method in reference proteomes reveals a large number of potential surface proteins not previously identified. The method is available for academic use at http://195.251.108.230/apps.compgen.org/CW-PRED/ and as standalone software.
Artificial markets are an emerging form of agent-based simulation in which agents represent individual industries, firms, or consumers interacting under simulated market conditions. While artificial markets demonstrate considerable potential for advancing innovation research, the validity of the method depends on the ability of researchers to construct agents that faithfully capture the key behavior of targeted entities. To date, few such methods have been documented in the academic literature.
This article describes a novel method for combining qualitative innovation research (case studies, grounded theory, and sequence analysis) with software engineering techniques to synthesize simulation-ready theories of adoption behavior. A step-by-step example is provided from the transportation domain. The result was a theory of adoption behavior that is sufficiently precise and formal to be expressed in Unified Modeling Language (UML). The article concludes with a discussion of the limitations of the method and recommendations future applications to the study of diffusion of innovation.
In the last years, the completion of the human genome sequencing showed a wide range of new challenging issues involving raw data analysis. In particular, the discovery of information implicitly encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually represented by patterns frequently occurring in the sequences. Because of biological observations, a specific class of patterns is becoming particularly interesting: frequent structured patterns. In this respect, it is biologically meaningful to look at both "exact" and "approximate" repetitions of pattens within the available sequences. This paper gives a contribution in this setting by providing algorithms which allow to discover frequent structured patterns, both in "exact" and "approximate" form, present in a collection of input biological sequences.
The causes of risks are hard to identify if events occur temporarily and disappear before the occurrence of their observable effects. In the face of this hurdle, transient causal events of significant effects are desired to be explained, for the safety of human life. In this paper, Kamishibai KeyGraph, a variation of KeyGraph developed to deal with sequential data, is presented as a tool to explain the causality involving transient causes. This method is applied here for two data sets: (1) newspaper text on social events, and (2) data on earthquakes in Japan. The performance of this method is hard to evaluate quantitatively due to the nature of transient events, so we partially evaluate the qualitative scenarios interpreted subjectively from the visualized maps.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.