Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Although regular expressions do not correspond univocally to regular languages, it is still worthwhile to study their properties and algorithms. For the average case analysis one often relies on the uniform random generation using a specific grammar for regular expressions, that can represent regular languages with more or less redundancy. Generators that are uniform on the set of expressions are not necessarily uniform on the set of regular languages. Nevertheless, it is not straightforward that asymptotic estimates obtained by considering the whole set of regular expressions are different from those obtained using a more refined set that avoids some large class of equivalent expressions. In this paper we study a set of expressions that avoid a given absorbing pattern. It is shown that, although this set is significantly smaller than the standard one, the asymptotic average estimates for the size of the Glushkov automaton for these expressions does not differ from the standard case. Furthermore, for this set the asymptotic density of 𝜀-accepting expressions is also the same as for the set of all standard regular expressions.
The partial derivative automaton (𝒜PD) is an elegant simulation of a regular expression. Although it is, in general, smaller than the position automaton (𝒜POS), the algorithms that build 𝒜PD in quadratic worst-case time first compute 𝒜POS. Asymptotically, and on average for the uniform distribution, the size of 𝒜PD is half the size of 𝒜POS, being both linear on the size of the expression. We address the construction of 𝒜PD efficiently, on average, avoiding the computation of 𝒜POS. The expression and the set of its partial derivatives are represented by a directed acyclic graph with shared common subexpressions. We develop an algorithm for building 𝒜PDs from expressions in strong star normal form of size n that runs, on average, in time that asymptotically behaves as n3/24√log(n), and space as n3/2/(logn)3/4. Empirical results corroborate its good practical performance.
Regular expressions are used in many practical applications. Practical regular expressions are commonly called "regex". It is known that regex are different from regular expressions. In this paper, we give regex a formal treatment. We make a distinction between regex and extended regex; while regex represent regular languages, extended regex represent a family of languages larger than regular languages. We prove a pumping lemma for the languages expressed by extended regex. We show that the languages represented by extended regex are incomparable with context-free languages and a proper subset of context-sensitive languages. Other properties of the languages represented by extended regex are also studied.
In this paper two concurrent versions of Brzozowski's deterministic finite automaton (DFA) construction algorithm are developed from first principles, the one being a slight refinement of the other. We rely on Hoare's CSP as our notation.
The specifications that are proposed of the Brzozowski algorithm are in terms of the concurrent composition of a number of top-level processes, each participating process itself composed of several other concurrent processes. After considering a number of alternatives, this particular overall architectural structure seemed like a natural and elegant mapping from the sequential algorithm's structure.
While we have carefully argued the reasons for constructing the concurrent versions as proposed in the paper, there are of course, a large range of alternative design choices that could be made. There might also be scope for a more fine-grained approach to updating sets or checking for similarity of regular expressions. At this stage, we have chosen to abstract away from these considerations, and leave their exploration for a subsequent step in our research.
Antimirov and Mosses proposed a rewrite system for deciding the equivalence of two (extended) regular expressions. They argued that this method could lead to a better average-case algorithm than those based on the comparison of the equivalent minimal deterministic finite automata. In this paper we present a functional approach to that method, prove its correctness, and give some experimental comparative results. Besides an improved functional version of Antimirov and Mosses's algorithm, we present an alternative one using partial derivatives. Our preliminary results lead to the conclusion that, indeed, these methods are feasible and, most of the time, faster than the classical methods.
In this paper we extend finite-memory automata with non-deterministic reassignment that allows an automaton to "guess" the future content of its registers, and introduce the corresponding notion of a regular expression over an infinite alphabet.
Membership problems for compressed strings in regular languages are investigated. Strings are represented by straight-line programs, i.e., context-free grammars that generate exactly one string. For the representation of regular languages, various formalisms with different degrees of succinctness (e.g., suitably extended regular expressions, hierarchical automata) are considered. Precise complexity bounds are derived. Among other results, it is shown that the compressed membership problem for regular expressions with intersection is PSPACE-complete. This solves an open problem of Plandowski and Rytter.
We summarize results on the complexity of regular(-like) expressions and tour a fragment of the literature. In particular we focus on the descriptional complexity of the conversion of regular expressions to equivalent finite automata and vice versa, to the computational complexity of problems on regular-like expressions such as, e.g., membership, inequivalence, and non-emptiness of complement, and finally on the operation problem measuring the required size for transforming expressions with additional language operations (built-in or not) into equivalent ordinary regular expressions.
The partial derivative automaton () is usually smaller than other nondeterministic finite automata constructed from a regular expression, and it can be seen as a quotient of the Glushkov automaton (
). By estimating the number of regular expressions that have ε as a partial derivative, we compute a lower bound of the average number of mergings of states in
and describe its asymptotic behaviour. This depends on the alphabet size, k, and for growing k's its limit approaches half the number of states in
. The lower bound corresponds to consider the
automaton for the marked version of the regular expression, i.e. where all its letters are made different. Experimental results suggest that the average number of states of this automaton, and of the
automaton for the unmarked regular expression, are very close to each other.
Quantitative aspects of systems like consumption of resources, output of goods, or reliability can be modeled by weighted automata. Recently, objectives like the average cost or the longtime peak power consumption of a system have been modeled by weighted automata which are not semiring weighted anymore. Instead, operations like limit superior, limit average, or discounting are used to determine the behavior of these automata. Here, we introduce a new class of weight structures subsuming a range of these models as well as semirings. Our main result shows that such weighted automata and Kleene-type regular expressions are expressively equivalent both for finite and infinite words.
In this paper, the relation between the Glushkov automaton and the partial derivative automaton
of a given regular expression, in terms of transition complexity, is studied. The average transition complexity of
was proved by Nicaud to be linear in the size of the corresponding expression. This result was obtained using an upper bound of the number of transitions of
. Here we present a new quadratic construction of
that leads to a more elegant and straightforward implementation, and that allows the exact counting of the number of transitions. Based on that, a better estimation of the average size is presented. Asymptotically, and as the alphabet size grows, the number of transitions per state is on average 2. Broda et al. computed an upper bound for the ratio of the number of states of
to the number of states of
which is about ½ for large alphabet sizes. Here we show how to obtain an upper bound for the number of transitions in
, which we then use to get an average case approximation. In conclusion, assymptotically, and for large alphabets, the size of
is half the size of the
. This is corroborated by some experiments, even for small alphabets and small regular expressions.
We investigate the descriptional complexity of nondeterministic biautomata, which are a generalization of biautomata [O. KLÍMA, L. POLÁK: On biautomata. RAIRO — Theor. Inf. Appl., 46(4), 2012]. Simply speaking, biautomata are finite automata reading the input from both sides; although the head movement is nondeterministic, additional requirements enforce biautomata to work deterministically. First we study the size blow-up when determinizing nondeterministic biautomata. Further, we give tight bounds on the number of states for nondeterministic biautomata accepting regular languages relative to the size of ordinary finite automata, regular expressions, and syntactic monoids. It turns out that as in the case of ordinary finite automata nondeterministic biautomata are superior to biautomata with respect to their relative succinctness in representing regular languages.
The equivalence of finite automata and regular expressions dates back to the seminal paper of Kleene on events in nerve nets and finite automata from 1956. In the present paper we tour a fragment of the literature and summarize results on upper and lower bounds on the conversion of finite automata to regular expressions and vice versa. As an interesting special case also one-unambiguous regular expressions, a sort of a deterministic version of a regular expression, are considered. We also briefly recall the known bounds for the removal of spontaneous transitions (ε-transitions) on non-ε-free nondeter-ministic devices. Moreover, we report on recent results on the average case descriptional complexity bounds for the conversion of regular expressions to finite automata and new developments on the state elimination algorithm that converts finite automata to regular expressions.
Regular expressions are often ambiguous. We present a novel method based on Brzozowski’s derivatives to aid the user in diagnosing ambiguous regular expressions. We introduce a derivative-based finite state transducer to generate parse trees and minimal counter-examples. The transducer can be easily customized to either follow the POSIX or Greedy disambiguation policy and based on a finite set of examples it is possible to examine if there are any differences between POSIX and Greedy.
Glushkov automaton is an efficient structure for matching regular expressions. We present an O(k2k) bits space representation of Glushkov automata of regular expressions, where k is the number of strings in the regular expression. The state transition runs in time O(⌈m/w⌉logk), where w is the size of machine words, and m is the number of states in the Glushkov automaton. For k≤4m/(√8w+9−1), the time is O(⌈m/w⌉). Our approach is based on two operations on words that retrieve and set bits on a specific set of positions of a word. We present implementations of these operations using bit-parallelism and a permutation network, which are simple and efficient.
For regular expressions in (strong) star normal form a large set of efficient algorithms is known, from conversions into finite automata to characterisations of unambiguity. In this paper we study the average complexity of this class of expressions using analytic combinatorics. As it is not always feasible to obtain explicit expressions for the generating functions involved, here we show how to get the required information for the asymptotic estimates with an indirect use of the existence of Puiseux expansions at singularities. We study, asymptotically and on average, the alphabetic size, the size of the 𝜀-follow automaton and of the position automaton, as well as the ratio and the size of these expressions to standard regular expressions.
In this paper, we investigate the compensation loops, a DNA rearrangement in chromosomes due to unequal crossing over. We study the effect of compensation loops over the gene duplication, and we formalize it as a restricted case of gene duplication in general. We study this biological process under the point of view of formal languages, and we provide some results about the languages defined in this way.
We are interested in regular expressions and transducers that represent word relations in an alphabet-invariant way — for example, the set of all word pairs u,v where v is a prefix of u independently of what the alphabet is. Current software systems of formal language objects do not have a mechanism to define such objects. We define transducers in which transition labels involve what we call set specifications, some of which are alphabet invariant. In fact, we give a more broad definition of automata-type objects, called labelled graphs, where each transition label can be any string, as long as that string represents a subset of a certain monoid. Then, the behavior of the labelled graph is a subset of that monoid. We do the same for regular expressions. We obtain extensions of a few classic algorithmic constructions on ordinary regular expressions and transducers at the broad level of labelled graphs and in such a way that the computational efficiency of the extended constructions is not sacrificed. For transducers with set specs we obtain further algorithms that can be applied to questions about independent regular languages as well as a decision question about synchronous transducers.
Bulk-Synchronous Parallel (BSP) is a bridging model between abstract execution and concrete parallel architectures that retains enough information to model performance scalability. In order to model BSP program executions, Hains adapted the theory of finite automata to the BSP paradigm by introducing BSP automata [12].
In the initial description of the theory, BSP automata had to be explicitly defined as structured sets of states and transitions. The lack of a clean and efficient algorithm for generating them from regular expressions would have prevented this theory from being used in practice. To alleviate this problem we introduce in this paper an algorithm that generates a BSP automaton recognizing a BSP language defined by a BSP regular expression. The main practical use of BSP automata developed in this paper is the parallelization of finite state automata with an explicit distribution and a performance model, that enable parallel matching of regular expressions. Secondarily, BSP regular expressions provide a convenient structure for automatic code generation of imperative BSP programs that is also developed in this paper.
Segmentation of handwritten text into lines, words and characters is one of the important steps in the handwritten text recognition process. In this paper, we propose a float fill algorithm for segmentation of unconstrained Devanagari text into words. Here, a text image is directly segmented into individual words. Rectangular boundaries are drawn around the words and horizontal lines are detected with template matching. A mask is designed for detecting the horizontal line and is applied to each word from left to right and top to bottom of the document. Header lines are removed for character separation. A new segment code features are extracted for each character. In this paper, we present the results of multiple classifier combination for offline handwritten Devanagari characters. The use of regular expressions in handwritten characters is a novel concept and they are defined in a manner so that they can become more robust to noise.
We have achieved an accuracy of 94% for word level segmentation, 95% for coarse classification and 85% for fine classification of character recognition. On experimentation with a dataset of 5000 samples of characters, the overall recognition rate observed is 95% as we considered top five choice results. The proposed combined classifier can be applied to handwritten character recognition of any other language like English, Chinese, Arabic, etc. and can recognize the characters with same accuracy.18 For printed characters we have achieved accuracy of 100%, only by applying the regular expression classifier.17