Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Given two sequences S1, S2, and a constrained sequence C, a longest common subsequence of S1, S2 with restriction to C is called a constrained longest common subsequence of S1 and S2 with C. At the same time, an optimal alignment of S1, S2 with restriction to C is called a constrained pairwise sequence alignment of S1 and S2 with C. Previous algorithms have shown that the constrained longest common subsequence problem is a special case of the constrained pairwise sequence alignment problem, and that both of them can be solved in O(rnm) time, where r, n, and m represent the lengths of C, S1, and S2, respectively. In this paper, we extend the definition of constrained pairwise sequence alignment to a more flexible version, called weighted constrained pairwise sequence alignment, in which some constraints might be ignored. We first give an O(rnm)-time algorithm for solving the weighted constrained pairwise sequence alignment problem, then show that our extension can be adopted to solve some constraint-related problems that cannot be solved by previous algorithms for the constrained longest common subsequence problem or the constrained pairwise sequence alignment problem. Therefore, in contrast to previous results, our extension is a new and suitable model for sequence analysis.
RNA-Seq is a Next-Generation Sequencing (NGS) protocol for sequencing the messenger RNA in a cell and generates millions of short sequence fragments, reads, in a single run. These reads can be used to measure levels of gene expression and to identify novel splice variants of genes. One of the critical steps in an RNA-Seq experiment is mapping NGS reads to the reference genome. Because RNA-Seq reads can span over more than one exon in the genome, this task is challenging. In the last decade, tools for RNA-Seq alignment have emerged, but most of them run in two phases. First, the pipeline only maps reads that have a direct match in the reference, and the remaining are set aside as initially unmapped reads. Then, they use heuristics based approaches, clustering or even annotations, to decide where to align the later. This work presents an efficient computational solution for the problem of transcriptome alignment, named SpliceTAPyR. It identifies signals of splice junctions and relies on compressed full-text indexing methods and succinct data structures to efficiently align RNA-Seq reads in a single phase. This way it achieves the same or better accuracy than other tools while using considerably less memory and time to the most competitive tools.
This paper presents a parallel system capable of accelerating biological sequence alignment on the graphics processing unit (GPU) grid. The GPU grid in this paper is a desktop grid system that utilizes idle GPUs and CPUs in the office and home. Our parallel implementation employs a master-worker paradigm to accelerate an OpenGL-based algorithm that runs on a single GPU. We integrate this implementation into a screensaver-based grid system that detects idle resources on which the alignment code can run. We also show some experimental results comparing our implementation with three different implementations running on a single GPU, a single CPU, or multiple CPUs. As a result, we find that a single non-dedicated GPU can provide us almost the same throughput as two dedicated CPUs in our laboratory environment, where GPU-equipped machines are ordinarily used to develop GPU applications. In a dedicated environment, the GPU-accelerated code achieves five times higher throughput than the CPU-based code. Furthermore, a linear speedup of 30.7X is observed on a 32-node cluster of dedicated GPUs. We also implement a compute unified device architecture (CUDA) based algorithm to demonstrate further acceleration.
This paper introduces a new linear systolic algorithm [10] for the sequence alignment problem [18]. It is made up of min(n, m) processors and computes the edit distance and the sequence alignment of two sequences Target and Source in time min(n, m) + 2.max(n, m), where n and m denote the lengths of Target and Source respectively. Its characteristics make it faster and more efficient than the previous linear array algorithm for the alignment search.
In this paper, a biometric technique based on gesture recognition is proposed to improve the security of operations requiring authentication in mobile phones. Users are authenticated by making a gesture invented by them holding a mobile phone embedding an accelerometer on their hand. An analysis method based on sequence alignment is proposed and evaluated in different experiments. Firstly, a test of distinctiveness of gestures has been proposed obtaining an equal error rate (EER) of 4.98% with a database of 30 users and four repetitions. With the same database, a second experiment representing the unicity of accessing attempts has resulted in an EER value of 1.92%. Finally, a third experiment to evaluate the robustness of the technique has examined a database of 40 users with eight repetitions and real falsification attempts, performed by three impostors from the study of recordings of the carrying out of the original gestures, resulting in an EER of 2.5%.
The continuous image sequence recognition is more difficult than the single image recognition because the classification of continuous image sequences and the image edge recognition must be very accurate. Hence, a method based on sequence alignment for action segmentation and classification is proposed to reconstruct a template sequence by estimating the mean action of a class category, which calculates the distance between a single image and a template sequence by sparse coding in Dynamic Time Warping. The proposed method, the methods of Kulkarni et al. [Continuous action recognition based on sequence alignment, Int. J. Comput. Vis. pp. 1–26.] and Hoai et al. [Joint segmentation and classification of human actions in video, IEEE Conf. Computer Vision and Pattern Recognition, 2008, pp. 108–119.] are compared on the recognition accuracy of the continuous recognition and isolated recognition, which clearly shows that the proposed method outperforms the other methods. When applied to continuous gesture classification, it not only can recognize the gesture categories more quickly and accurately, but is more realistic in solving continuous action recognition problems in a video than the other existing methods.
Performance monitoring counters (PMCs) are of great value to monitor the status of processors and their further analysis and modeling. In this paper, we explore a novel problem called PMC integration, i.e., how to combine a group of PMCs which are collected asynchronously together. It is well known that, due to hardware constraints, the number of PMCs that can be measured concurrently is strictly limited. It means we cannot directly acquire all the phenomenon features that are related with the system performance. Clearly, this source raw data shortage is extremely frustrating to PMCs based analysis and modeling tasks, such as PMCs based power estimation. To deal with this problem, we introduce a neighboring interval power values based PMC data integration approach. Based on the activity similarity of easily collected power dissipation values, the proposed approach can automatically combine distinct categories of PMC data together and hence realize the recovery of intact raw PMC data. In addition, the significance and effectiveness of the proposed approach are experimentally verified on a common task, the PMCs based power consumption modeling.
Comparing protein structures in three dimensions is a computationally expensive process that makes a full scan of a protein against a library of known protein structures impractical. To reduce the cost, we can use an approximation of the three dimensional structure that allows protein comparison to be performed quickly to filter away dissimilar proteins. In this paper, we present a new algorithm, called SCALE, for protein structure comparison. In SCALE, a protein is represented as a sequence of secondary structure elements (SSEs) augmented with 3D structural properties such as the distances and angles between the SSEs. As such, the comparison between two proteins is reduced to a sequence alignment problem between their corresponding sequences of SSEs. The 3-D structural properties of the proteins contribute to the similarity score between the two sequences. We have implemented SCALE, and compared its performance against existing schemes. Our performance study shows that SCALE outperforms existing methods in terms of both efficiency and effectiveness (measured in terms of precision and recall). To avoid exhaustive search, an index based on the structural properties is also proposed. The index prunes away a considerable amount of dissimilar proteins given a query protein.
The paper illustrates some mathematical problems that are motivated by molecular biological applications and the techniques that are used to address them. The context is that of detecting evolutionary relationship on the basis of molecular sequence data and of measuring its strength. The Stein–Chen method is shown to play a central role in the theoretical analysis of many of the procedures used in practice.
The novel introduction of spaced seed idea in the filtration step of sequence comparison by Ma, Tromp and Li (2002) has greatly increased the sensitivity of homology search without compromising the speed of search. In this paper, we survey some recent works on spaced seeds with main emphasis on their probabilistic and computational aspects.
Source code plagiarism is easy to commit but difficult to catch. Many approaches have been proposed in the literature to automate its detection; however there is little consensus on what works best. In this paper, we propose two new measures for determining the accuracy of a given technique and describe an approach to convert code files into strings which can then be compared for similarity in order to detect plagiarism. We then compare several string comparison techniques, heavily utilised in the area of biological sequence alignment, and compare their performance on a large collection of student source code containing various types of plagiarism. Experimental results show that the compared techniques succeed in matching a plagiarised file to its original files upwards of 90% of the time. Finally, we propose a modification for these algorithms that drastically improves their runtimes with little or no effect on accuracy. Even though the ideas presented herein are applicable to most programming languages, we focus on a case study pertaining to an introductory-level Visual Basic programming course offered at our institution.
There is a pressing need to align the growing set of expressed sequence tags (ESTs) with the newly sequenced human genome. However, the problem is complicated by the exon/intron structure of eukaryotic genes, misread nucleotides in ESTs, and the millions of repetitive sequences in genomic sequences. To solve this problem, algorithms that use dynamic programming have been proposed. In reality, however, these algorithms require an enormous amount of processing time. In an effort to improve the computational efficiency of these classical DP algorithms, we developed software that fully utilizes lookup-tables to detect the start- and endpoints of an EST within a given DNA sequence efficiently, and subsequently promptly identify exons and introns. In addition, the locations of all splice sites must be calculated correctly with high sensitivity and accuracy, while retaining high computational efficiency. This goal is hard to accomplish in practice, due to misread nucleotides in ESTs and repetitive sequences in the genome. Nevertheless, we present two heuristics that effectively settle this issue. Experimental results confirm that our technique improves the overall computation time by orders of magnitude compared with common tools, such as SIM4 and BLAT, and simultaneously attains high sensitivity and accuracy against a clean dataset of documented genes.
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem — a set of target alignments, an associated probability distribution, and a seed model — that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.
Amino acid substitution matrices play an essential role in protein sequence alignment, a fundamental task in bioinformatics. Most widely used matrices, such as PAM matrices derived from homologous sequences and BLOSUM matrices derived from aligned segments of PROSITE, did not integrate conformation information in their construction. There are a few structure-based matrices, which are derived from limited data of structure alignment. Using databases PDB_SELECT and DSSP, we create a database of sequence-conformation blocks which explicitly represent sequence-structure relationship. Members in a block are identical in conformation and are highly similar in sequence. From this block database, we derive a conformation-specific amino acid substitution matrix CBSM60. The matrix shows an improved performance in conformational segment search and homolog detection.
Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.
Metagenomic sequencing technologies are advancing rapidly and the size of output data from high-throughput genetic sequencing has increased substantially over the years. This brings us to a scenario where advanced computational optimizations are requested to perform a metagenomic analysis. In this paper, we describe a new parallel implementation of nucleotide BLAST (MPI-blastn) and a new tool for taxonomic attachment of Basic Local Alignment Search Tool (BLAST) results that supports the NCBI taxonomy (NCBI-TaxCollector). MPI-blastn obtained a high performance when compared to the mpiBLAST and ScalaBLAST. In our best case, MPI-blastn was able to run 408 times faster in 384 cores. Our evaluations demonstrated that NCBI-TaxCollector is able to perform taxonomic attachments 125 times faster and needs 120 times less RAM than the previous TaxCollector. Through our optimizations, a multiple sequence search that currently takes 37 hours can be performed in less than 6 min and a post processing with NCBI taxonomic data attachment, which takes 48 hours, now is able to run in 23 min.
Sequence alignment is a fundamental and important tool for sequence data analysis in molecular biology. Many applications in molecular biology require the detection of a similarity pattern displayed by a number of DNA and protein sequences. Visual front-ends are useful for an intuitive viewing of alignment and help to analyze the structure, functions, and evolution of the DNA and protein. In this paper, we designed and implemented an interactive system for data visualization in DNA and proteins, which can be used in determining a sequence alignment, similarity search of sequence data, and function interference. Experimental results show that a user can easily operate the system after one hour's practice on the proposed system, which provides a clean output, easy identification of similarity and visualization of alignment data.
Reconstructed biological networks are the essence of knowledge originating from experiments, scientific literature, databases and modeling. Proteins are the major players in biological networks. If the function of a protein is not yet known, it can often be deduced from homologous proteins that are already experimentally characterized. As such conclusions are not as reliable as experimental evidences, they should be well documented and reviewed when experimental data is available. Inconsistent operation of the resulting network may indicate invalid functional assignments. Here we present a novel technique to refer to annotated sequence and 3D-structure alignments in terms of Web links. By clicking the Web link the alignment is viewed in the protein viewer STRAP. References to public protein databases such as EMBL, KEGG, GENBANK, PDB, PFAM, PRODOM and UNIPROT/SWISSPROT are encoded in the Web-link whereas the alignment gaps are computed dynamically. Site specific annotations and 3D-rendering commands may also be included in the Web-link. In contrast, sequence features such as active site residues, phosphorylation sites and ligand binding sites do not need to be specified, as long as they are retrievable from public databases. The method has been developed for an information management system that is used for the reconstruction of metabolic pathways. The alignment viewer may also be of interest for experimentalists, as it can be used to document sites of interest in the proteins under experimental investigation. These alignment Web links may be included in project Web sites. Availability: The STRAP program is published under the GNU-license condition and is automatically downloaded from http://3d-alignment.eu/ or http://www.charite.de/bioinf/strap/ when an alignment reference is clicked.
Many theoretical advances have been made in applying probabilistic inference methods to improve the power of sequence homology searches, yet the BLAST suite of programs is still the workhorse for most of the field. The main reason for this is practical: BLAST's programs are about 100-fold faster than the fastest competing implementations of probabilistic inference methods. I describe recent work on the HMMER software suite for protein sequence analysis, which implements probabilistic inference using profile hidden Markov models. Our aim in HMMER3 is to achieve BLAST's speed while further improving the power of probabilistic inference based methods. HMMER3 implements a new probabilistic model of local sequence alignment and a new heuristic acceleration algorithm. Combined with efficient vector-parallel implementations on modern processors, these improvements synergize. HMMER3 uses more powerful log-odds likelihood scores (scores summed over alignment uncertainty, rather than scoring a single optimal alignment); it calculates accurate expectation values (E-values) for those scores without simulation using a generalization of Karlin/Altschul theory; it computes posterior distributions over the ensemble of possible alignments and returns posterior probabilities (confidences) in each aligned residue; and it does all this at an overall speed comparable to BLAST. The HMMER project aims to usher in a new generation of more powerful homology search tools based on probabilistic inference methods.
We evaluate the performance of common substitution matrices with respect to structural similarities. For this purpose, we apply an all-versus-all pairwise sequence alignment on the ASTRAL40 [7] dataset, consisting of 7290 entries with a pairwise sequence identity of at most 40%. Afterwards, we compare the 100 highest scoring sequence alignments to their corresponding structural alignments, which we obtain from our structure alignment database. Our database consists of about 18.6 million pairwise entries. We calculated these alignments by applying the current version of GANGSTA [1], our non-sequential structural alignment tool, on about 26 million pairs. The results illustrate the difficulty of homology based protein structure prediction in cases of low sequence similarity. Further, the large fraction of structurally similar proteins in the ASTRAL40 dataset is quantitatively measured. Thereby, this investigation yields a new perspective on the topic of sequence and structure relation. Hence, our finding is a large-scale quality measure for any sequence based method, which aims to detect structural similarities.