Please login to be able to save your searches and receive alerts for new content matching your search criteria.
We compile pattern matching for overlapping patterns in term rewriting systems into a minimal, tree matching automata. The use of directed acyclic graphs that shares all the isomorphic subautomata allows us to reduce space requirements. These are duplicated in the tree automaton. We design an efficient method to identify such subautomata and avoid duplicating their construction while generating the dag automaton. We compute some bounds on the size of the automata, thereby improving on previously known equivalent bounds for the tree automaton.
In Biochemistry, tandem mass spectrometry (MS/MS) is the most common method for peptide and protein identifications. One computational method to get a peptide sequence from the MS/MS data is called de novo sequencing, which is becoming more and more important in this area. However De novo sequencing usually can only confidently determine partial sequences, while the undetermined parts are represented by "mass gaps". We call such a partially determined sequence a gapped sequence tag. When a gapped sequence tag is searched in a database for protein identification, the determined parts should match the database sequence exactly, while each mass gap should match a substring of amino acids whose masses add up to the value of the mass gap. In such a case, the standard string matching algorithm does not work any more. In this paper, we present a new efficient algorithm to find the matches of gapped sequence tags in a protein database.
We present fast algorithms for computing the largest common subtree (LCST) and for computing the optimal alignment when two similar unordered trees are given. For the LCST problem for rooted trees, we present an O(4Kn) time algorithm, where n is the maximum size of two input trees and the total number of non-common nodes is bounded by K. We extend this algorithm for unrooted trees and obtain an O(K4Kn) time algorithm. Both of these are subquadratic for two similar trees within K = o(log4 n) differences. We also show that the alignment problem for rooted and unordered trees of bounded degree can be solved in O(γKn) time for a constant γ. Particularly γ is at most 4.45 for unlabeled binary trees.
The largest common point set problem (LCP) is, given two point set P and Q in d-dimensional Euclidean space, to find a subset of P with the maximum cardinality that is congruent to some subset of Q. We consider a special case of LCP in which the size of the largest common point set is at least (|P| + |Q| - k)/2. We develop efficient algorithms for this special case of LCP and a related problem. In particular, we present an O(k3n1.34 + kn2log n) time algorithm for LCP in two-dimensions, which is much better for small k than an existing O(n3.2log n) time algorithm, where n = max{|P|,|Q|}.
The possibility of applying compressed matching in JPEG encoded images is investigated and the problems raised by the scheme are discussed. A part of the problems can be solved by the use of some auxiliary data which yields various time/space trade-offs. Finally, approaches to deal with extensions such as allowing scaling or rotations are suggested.
We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.
The problem of extracting a basis of irredundant motifs from a sequence is considered. In previous work such bases were built incrementally for all suffixes of the input string s in O(n3), where n is the length of s. Faster, non-incremental algorithms have been based on the landmark approach to string searching due to Fischer and Paterson, and exhibit respective time bounds of O(n2log n log|Σ|) and O(|Σ|n2log2 n log log n), with Σ denoting the alphabet. The algorithm by Fischer and Paterson makes crucial use of the FFT, which is impractical with long sequences.
The present paper describes an off-line algorithm for binary strings that takes O(n2) time. The algorithm does not need to resort to the FFT and yet its performance is optimal for finite Σ.
The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1, …, aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1, …, pσ), where pi = |{j | sj = ai}|. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time.
The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Θ(n2) time.
The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size σ ≥ 2 and has sub-linear expected time complexity. More precisely, we present two variants of the algorithm, both using an O(n) size data structure, each of which can be constructed in O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of , where m = ∑i qi is the length of a string with Parikh vector q. The second uses wavelet trees and improves the expected runtime to
, i.e., by a factor of log m.
Let S=s1s2...sn be a string, with symbols from an alphabet. S is said to be degenerate if for some positions, say i, si can contain a subset of symbols from the symbol alphabet, rather than just one symbol. Given a text string T and a pattern P, both with symbols from an alphabet Σ, the degenerate string matching problem, is to find positions in T where P occured, such that T, P, or both are allowed to be degenerate. Though some algorithms have been proposed, their huge computational cost pose a significant challenge to their practical utilization. In this work, we propose IDPM, an improved degenerate pattern matching algorithm based on an extension of the Boyer–Moore algorithm. At the preprocessing phase, the algorithm defines an alphabet-independent compatibility rule, and computes the shift arrays using respective variants of the bad character and good suffix heuristics. At the search phase, IDPM improves the matching speed by using the compatibility rule. On average, the proposed IDPM algorithm has a linear time complexity with respect to the text size, and to the overall size of the pattern. IDPM demonstrates significance performance improvement over state-of-the-art approaches. It can be used in fast practical degenerate pattern matching with large data sizes, with important applications in flexible and scalable searching of huge biological sequences.
A weighted string is a string in which a set of letters may occur at each position with respective occurrence probabilities. Weighted strings, also known as position weight matrices, weighted sequences or uncertain sequences, naturally arise in many contexts. In this paper, we study the problem of weighted string matching with a special focus on average-case analysis. Given a weighted pattern string x of length m, a text string y of length n>m, both on a constant-sized alphabet of size σ, and a cumulative weight threshold1/z, defined as the minimal probability of occurrence of factors in a weighted string, we present an on-line algorithm requiring average-case search time o(n) for pattern matching for weight ratiozm<min{12logz+1z,logσlogz(logm+loglogσ)}. For a pattern string x of length m, a weighted text string y of length n>m, both on a constant-sized alphabet, and a cumulative weight threshold 1/z, we present an on-line algorithm requiring average-case search time o(n) for the same weight ratio. The importance of these algorithms lies on the fact that, for these ratios, they can work in sublinear search time in the size of the input text, and in linear preprocessing costs in the size of the pattern.
An efficient method for fingerprint searching using recurrent autoassociative memory is proposed. This algorithm uses recurrent autoassociative memory, which uses a connectivity matrix to find if the pattern being searched is already stored in the database. The advantage of this memory is that a big database is to be searched only if there is a matching pattern. Fingerprint comparison is usually based on minutiae matching, and its efficiency depends on the extraction of minutiae. This process may reduce the speed, when large amount of data is involved. So, in the proposed method, a simple approach has been adopted, wherein first determines the closely matched fingerprint images, and then determines the minutiae of only those images for finding the more appropriate one. The gray level value of pixels along with its neighboring ones are considered for the extraction of minutiae, which is more easier than using ridge information. This approach is best suitable when database size is large.
This paper presents a unifying review of a learning-based framework for kernel-based attributed graph matching. The framework, which includes as special cases the RKHS Interplator-Based Graph Matching (RIGM) and Interpolator-Based Kronecker Product Graph Matching (IBKPGM) algorithms, incorporates a general approach where no assumption is made about the adjacency structure of the graphs to be matched. Corresponding pairs of attributed adjacency matrices and attribute vectors of an input and reference graph are used as the input–output training set of a constrained multi-input multi-output multi-variable mapping to be learned. It is shown that a Reproducing Kernel Hilbert Space (RKHS) based interpolator can be used to infer this mapping. Partially constraining the inferred mapping by the generation of additional consistency input–output training pairs and the use of polynomial feature augmentation lead to improved performance. The proposed learning-based framework avoids the explicit calculation of compatibility values.
Multi-string matching (MSM) is a core technique searching a text string for all occurrences of some string patterns. It is widely used in many applications. However, as the number of string patterns increases, most of the existing algorithms suffer from two issues: the long matching time, and the high memory consumption. To address these issues, in this paper, a fast matching engine is proposed for large-scale string matching problems. Our engine includes a filter module and a verification module. The filter module is based on several bitmaps which are responsible for quickly filtering out the invalid positions in the text, while for each potential matched position, the verification module confirms true pattern occurrence. In particular, we design a compact data structure called Adaptive Matching Tree (AMT) for the verification module, in which each tree node only saves some pattern fragments of the whole pattern set and the inner structure of each tree node is chosen adaptively according to the features of the corresponding pattern fragments. This makes the engine time and space efficient. The experiments indicate that, our matching engine performs better than the compared algorithms, especially for large pattern sets.
The major problems in parsing English conjunctions and comparatives are ambiguities of scoping and ellipsis. Scoping ambiguities occur when a parser cannot deterministically detect boundaries of constituents, while ellipsis ambiguities occur when a parser cannot deterministically detect missing components. Since simple lookahead mechanisms cannot collect adequate information to resolve these ambiguities, a parsing strategy that only employs such mechanisms will need to backtrack each time it makes incorrect assumptions. In this paper, we extend the Wait-And-See strategy to parse conjunctions and comparatives deterministically and simultaneously. Several mechanisms, such as bottom-up preparsing, suspension, and pattern matching, are implemented. The bottom-up preparsing accesses the dictionary and recognizes isolated sentence fragments which can be determined without ambiguities. The suspension, which is different from Marcus’s attention shifting, allows the parser to suspend temporally at ambiguous points and continue to parse the rest of the sentence until it obtains the necessary information to resolve the ambiguities. Pattern matching uses the concept of symmetry to detect missing components (the ellipses) in the two conjoined or compared sentence fragments.
The recognition of patterns is an important task in robot and computer vision. The patterns themselves could be one- or two-dimensional, depending upon the application. Pattern matching is a computationally intensive and time consuming operation. The design of special purpose hardware could speed up the matching task considerably, making real-time responses possible. Advances in parallel processing and VLSI technologies have made it possible to implement inexpensive, efficient and very fast custom designs. Many approaches and solutions have been proposed in the literature for hardware implementations of pattern matching techniques. In this paper, we present a detailed overview of some of the important contributions in the area of hardware algorithms and architectures for pattern matching.
This paper reviews some gradient edge detection methods and proposes a new detector — the template matching edge detector (TMED). This detector utilizes the concepts of pattern analysis and the template matching of 3×3 masks. A set of performance criteria was used to evaluate the gradient edge detectors as well as the template matching edge detector. The results indicate that the new method is superior to the other gradient edge detectors. In addition, the template matching edge detector has also demonstrated good performance on noisy images. It can obtain very precise edge detection of single pixel width.
The recognition of polygons in 3-D space is an important task in robot vision. Advances in VLSI technology have now made it possible to implement inexpensive, efficient and very fast custom designs. The authors have earlier proposed a class of VLSI architectures for this computationally intensive task, which makes use of a set of local shape descriptors for polygons which are invariant under affine transformations, i.e. translation, scaling, rotation and orthographic projection from 3-D to any 2-D plane. This paper discusses the design and implementation of PMAC, a prototype for polygon matching, as a custom CMOS VLSI chip. The recognition procedure is based on the matching of edge-length ratios using a simplified version of the dynamic programming procedure commonly employed for string matching. The matching procedure also copes with partial occlusions of polygons. The implemented architecture is systolic and fully utilizes the principles of pipelining and parallelism in order to obtain high speed and throughput.
Given a pattern of length m and a text of length n, commonly m≪n, this paper presents a randomized parallel algorithm for pattern matching in O(n1/10) (=O(n1/10+(n−m)1/10)) time on a newly proposed n3/5×n2/5 modular meshconnected computers with multiple buses. Furthermore, the time bound of our parallel algorithm can be reduced to O(n1/11) if fewer processors are used.
Affine invariant pattern metrics are useful for shape recognition. It is important that such a metric is robust for various defects. We formalize these types of robustness using four axioms. Then, we present the reflection metric. This is an affine invariant metric defined for the large family of "boundary patterns". A boundary pattern is a finite union of n-1 dimensional algebraic surface patches in ℝn. Such a pattern may have multiple connected components. We prove that the reflection metric satisfies the four robustness axioms.
In this paper, a real-time low-cost geophone-based Elephant Footstep Vibration Detection and Identification (EFVDI) system is proposed. The system design started with a real-time low-cost generalized Footstep Vibration Recording and Analyzing (FVRA) system. A series of field experiments to record elephant footstep vibration (target) signals and other possible interfering ground vibration (noise) sources are conducted using the FVRA system. System’s actual field performance was evaluated in terms of maximum detection range, signal amplitude, detection ratio, signal frequency, signal time span, etc. Variations of system’s performance with several input parameters are also investigated. The recorded signals from target as well as noise sources are analyzed to extract different Signal Parameters (SPs). All SPs are saved in a Ground Vibration Signal Pattern Library (GVSPL) which is then used to frame accurate indigenous Elephant Identification Algorithm (EIA). The EIA is embedded in FVRA system to reshape it as specific Elephant Footstep Vibration Detection and Identification (EFVDI) system. The EFVDI system has successfully segregated elephant footsteps from other noise vibrations with high accuracy under simulated field experiment. The results from the proposed system will provide important data to the ongoing research of developing the much needed highly accurate Elephant Early Warning System (EEWS) in future.