Please login to be able to save your searches and receive alerts for new content matching your search criteria.
The local sequence alignment problem is the detection of similar subsequences in two given sequences of lengths n ≥ m. Unfortunately the common notion of local alignment suffers from some well-known anomalies which result from not taking into account the lengths of the aligned subsequences. We introduce the length restricted local alignment problem which includes as a constraint an upper limit T on the length of one of the subsequences to be aligned. Obvious extensions of dynamic programming formulations for the solution of the length restricted local alignment problem result in expensive computations: O(Tnm) time and O(Tm) space. We propose an efficient approximation algorithm using which we can find a solution satisfying the length bound, and whose score is within difference Δ of the optimum score for any given positive integer Δ in time O(nmT/Δ) using O(mT/Δ) space. We also introduce the cyclic local alignment problem and show how our idea can be applied to this case as well. This is a dual approach to the well-known cyclic edit distance problem. These results generalize directly to the case of affine gap penalties.
Given strings S1, S2, and P, the constrained longest common subsequence problem for S1 and S2 with respect to P is to find a longest common subsequence lcs of S1 and S2 which contains P as a subsequence. We present an algorithm which improves the time complexity of the problem from the previously known O(rn2m2) to O(rnm) where r, n, and m are the lengths of P, S1, and S2, respectively. As a generalization of this, we extend the definition of the problem so that the lcs sought contains a subsequence whose edit distance from P is less than a given parameter d. For the latter problem, we propose an algorithm whose time complexity is O(drnm).
We present fast algorithms for computing the largest common subtree (LCST) and for computing the optimal alignment when two similar unordered trees are given. For the LCST problem for rooted trees, we present an O(4Kn) time algorithm, where n is the maximum size of two input trees and the total number of non-common nodes is bounded by K. We extend this algorithm for unrooted trees and obtain an O(K4Kn) time algorithm. Both of these are subquadratic for two similar trees within K = o(log4 n) differences. We also show that the alignment problem for rooted and unordered trees of bounded degree can be solved in O(γKn) time for a constant γ. Particularly γ is at most 4.45 for unlabeled binary trees.
Approximate pattern matching has a wide range of applications and, depending on the type of approximation, there exist numerous algorithms for solving it. In this article we focus on texts which originate from OCRed documents, whose errors quite often have a particular form and are far from being random errors. We introduce a new variant of the edit distance metric, where apart from the traditional edit operations, two new operations are supported. The combination operation allows two or more symbols from a string x to be interpreted as a single symbol and then "matched" (or aligned) against a single symbol of a second string y. Its dual is the operation of a split, where a single symbol from x is broken down into a sequence of two or more other symbols, that can then be matched against an equal number of symbols from y. Our algorithm requires O(L) time for preprocessing, and O(mnk) time for computing the edit distance, where L is the total length of all the valid combinations/splits, m and n are the lengths of the two strings under comparison and k is an upper bound on the number of valid splits for any single symbol. The expected running time is O(mn).
In a number of fields, it is necessary to compare a witness string with a distribution. One possibility is to compute the probability of the string for that distribution. Another, giving a more global view, is to compute the expected edit distance from a string randomly drawn to the witness string. This number is often used to measure the performance of a prediction, the goal then being to return the median string, or the string with smallest expected distance. To be able to measure this, computing the distance between a hypothesis and that distribution is necessary. This paper proposes two solutions for computing this value, when the distribution is defined with a probabilistic finite state automaton. The first is exact but has a cost which can be exponential in the length of the input string, whereas the second is a fully polynomial-time randomized schema.
Kim and Park [A dynamic edit distance table, J. Disc. Algo., 2:302–312, 2004] proposed a method (KP) based on a “dynamic edit distance table” that allows one to efficiently maintain unit cost edit distance information between two strings A of length m and B of length n when the strings can be modified by single-character edits to their left or right ends. This type of computation is useful e.g. in cyclic string comparison. KP uses linear time, O(m+n), to update the distance representation after each single edit. Recently Hyyrö et al. [Incremental string comparison, J. Disc. Algo., 34:2-17, 2015] presented an efficient method for maintaining the dynamic edit distance table under general weighted edit distance, running in O(c(m+n)) time per single edit, where c is the maximum weight of the cost function. The work noted that the Θ(mn) space requirement, and not the running time, may be the main bottleneck in using the dynamic edit distance table. In this paper we take the first steps towards reducing the space usage of the dynamic edit distance table by RLE compressing A and B. Let M and N be the lengths of RLE compressed versions of A and B, respectively. We propose how to store the dynamic edit distance table using Θ(mN+Mn) space while maintaining the same time complexity as the previous methods for uncompressed strings.
This paper introduces a new linear systolic algorithm [10] for the sequence alignment problem [18]. It is made up of min(n, m) processors and computes the edit distance and the sequence alignment of two sequences Target and Source in time min(n, m) + 2.max(n, m), where n and m denote the lengths of Target and Source respectively. Its characteristics make it faster and more efficient than the previous linear array algorithm for the alignment search.
Two approaches to measuring the similarity between symbolically notated musical rhythms are compared with each other and with human judgments of perceived similarity. The first is the edit-distance, a popular transformation method, applied to the symbolic rhythm sequences. The second approach employs the histograms of the inter-onset-intervals (IOIs) calculated from the rhythms. Furthermore, two methods for dealing with the histograms are also compared. The first utilizes the Mallows distance, a transformation method akin to the Earth-Movers distance popular in computer vision, and the second extracts a group of standard statistical features, used in music information retrieval, from the IOI-histograms. The measures are compared using four contrastive musical rhythm data sets by means of statistical Mantel tests that compute correlation coefficients between the various dissimilarity matrices. The results provide evidence from the aural domain, that transformation methods such as the edit distance are superior to feature-based methods for predicting human judgments of similarity. The evidence also supports the hypothesis that IOI-histogram-based methods are better than music-theoretical structural features computed from the rhythms themselves, provided that the rhythms do not share identical IOI histograms.
The obvious need for using modern computer networking capabilities to enable the effective sharing of information has resulted in data-sharing systems, which store, and manage large amounts of data. These data need to be effectively searched and analyzed. More specifically, in the presence of dirty data, a search for specific information by a standard query (e.g., search for a name that is misspelled or mistyped) does not return all needed information, as required in homeland security, criminology, and medical applications, amongst others. Different techniques, such as soundex, phonix, n-grams, edit-distance, have been used to improve the matching rate in these name-matching applications. These techniques have demonstrated varying levels of success, but there is a pressing need for name matching approaches that provide high levels of accuracy in matching names, while at the same time maintaining low computational complexity. In this paper, such a technique, called ANSWER, is proposed and its characteristics are discussed. Our results demonstrate that ANSWER possesses high accuracy, as well as high speed and is superior to other techniques of retrieving fuzzy name matches in large databases.
This paper addresses the problem of automatic induction of the normalized form (lemma) of regular and mildly irregular words with no direct supervision using language-independent algorithms. More specifically, two string distance metric models (i.e. the Levenshtein Edit Distance algorithm and the Dice Coefficient similarity measure) were employed in order to deal with the automatic word lemmatization task by combining two alignment models based on the string similarity and the most frequent inflectional suffixes. The performance of the proposed model has been evaluated quantitatively and qualitatively. Experiments were performed for the Modern Greek and English languages and the results, which are set within the state-of-the-art, have showed that the proposed model is robust (for a variety of languages) and computationally efficient. The proposed model may be useful as a pre-processing tool to various language engineering and text mining applications such as spell-checkers, electronic dictionaries, morphological analyzers etc.
A strong assumption to derive generalization guarantees in the standard PAC framework is that training (or source) data and test (or target) data are drawn according to the same distribution. Because of the presence of possibly outdated data in the training set, or the use of biased collections, this assumption is often violated in real-world applications leading to different source and target distributions. To go around this problem, a new research area known as Domain Adaptation (DA) has recently been introduced giving rise to many adaptation algorithms and theoretical results in the form of generalization bounds. This paper deals with self-labeling DA whose goal is to iteratively incorporate semi-labeled target data in the learning set to progressively adapt the classifier from the source to the target domain. The contribution of this work is three-fold: First, we provide the minimum and necessary theoretical conditions for a self-labeling DA algorithm to perform an actual domain adaptation. Second, following these theoretical recommendations, we design a new iterative DA algorithm, called GESIDA, able to deal with structured data. This algorithm makes use of the new theory of learning with (ε,γ,τ)-good similarity functions introduced by Balcan et al., which does not require the use of a valid kernel to learn well and allows us to induce sparse models. Finally, we apply our algorithm on a structured image classification task and show that self-labeling domain adaptation is a new original way to deal with scaling and rotation problems.
Self-similarity of plants has attracted the attention of biologists for at least 50 years, yet its formal treatment is rare, and no measure for quantifying the degree of self-similarity currently exists. We propose a formal definition and measures of self-similarity, tailored to branching plant structures. To evaluate self-similarity, we make use of an algorithm for computing topological distances between branching systems, developed in computer science. The formalism is illustrated using theoretical branching systems, and applied to analyze self-similarity in two sample plant structures: inflorescences of Syringa vulgaris (lilac) and shoots of Oryza sativa (rice).
Given that exact pair-wise graph matching has a high computational cost, different representational schemes and matching methods have been devised in order to make matching more efficient. Such methods include representing the graphs as tree structures, transforming the structures into strings and then calculating the edit distance between those strings. However many coding schemes are complex and are computationally expensive. In this paper, we present a novel coding scheme for unlabeled graphs and perform some empirical experiments to evaluate its precision and cost for the matching of neighborhood subgraphs in online social networks. We call our method OSG-L (Ordered String Graph-Levenshtein). Some key advantages of the pre-processing phase are its simplicity, compactness and lower execution time. Furthermore, our method is able to match both non-isomorphisms (near matches) and isomorphisms (exact matches), also taking into account the degrees of the neighbors, which is adequate for social network graphs.
This paper addresses the importance of chance discovery in computer security. There are various methods to discover chances in computer usage, but they have such drawbacks as discovering only anomalies, not interpreting anomalies in conventional approach in the community of computer security. This paper focuses on a role of interpreting the type of anomalies by analyzing the state sequences using Viterbi algorithm and evaluating the distance between the standard model of anomaly type and the state sequence of discovered anomalies. Because the state sequences are not always extracted consistently due to environmental factor, the edit distance is utilized to measure the distance effectively. Experimental results show the possibility of the proposed method in computer security.