![]() |
Optical character recognition and document image analysis have become very important areas with a fast growing number of researchers in the field. This comprehensive handbook with contributions by eminent experts, presents both the theoretical and practical aspects at an introductory level wherever possible.
https://doi.org/10.1142/9789812830968_fmatter
The following sections are included:
https://doi.org/10.1142/9789812830968_0001
This chapter describes image processing methods for document image analysis. The methods are grouped into four categories, namely, image acquisition, image transformation, image segmentation, and feature extraction. In image acquisition, we describe the process of converting a document into its numerical representation, including image coding as a means to reduce the storage requirement. Image transformation addresses image-to-image operations, which comprise a large spectrum of techniques ranging from geometrical correction, filtering and figure-background separation to boundary detection and thinning. In image segmentation, we describe four popular techniques, namely, connected component labeling, X-Y-tree decomposition, run-length smearing, and Hough transform. Finally, a number of feature extraction methods, which constitute the basis of image classification, are presented.
https://doi.org/10.1142/9789812830968_0002
If pattern generation is viewed as a stochastic process, which usually is appropriate for optical character recognition, speech recognition, and similar tasks, pattern classification can be based on the idea of function approximation. Starting from decision theory, it is shown in this chapter that minimum error decisions are obtained by the classifier if it approximates the a posteriori probabilities ruling the pattern source (Bayesian approach). Since for practical problems the properties of the stochastic pattern source are only implicitly given by the learning set, this set should truly represent the data population underlying the recognition task. Furthermore, the approximating functions should be taken from a family of functions which have — at least in principle — the power of approximating any desirable function (universal approximator). This contribution discusses in some detail the important paradigms for classification based on function approximation: polynomial classifier (including the linear approach) and multilayer perceptron, which are both global approximation schemes; radial basis functions, which are local approximators similar to nearest neighbor techniques; and finally classifiers which are based on certain distribution assumptions. All these paradigms, including different variants, are introduced based on a common notation, and special characteristics of the different approaches are pointed out using the classification of handwritten digits as a practical example.
https://doi.org/10.1142/9789812830968_0003
In order to improve recognition results, decisions of several classifiers can be combined. The combination can be accomplished in different ways depending on the types of information produced by the individual classifiers. This chapter considers combination methods that can be applied when the information is provided at both the abstract and measurement levels.
For abstract-level classifiers, the combination methods discussed in this chapter consist of majority vote, weighted majority vote with weights derived from a genetic search algorithm, Bayesian formulation, and Behavior-Knowledge Space method. To combine the decisions of measurement-level classifiers, a multi-layer perceptron is used.
Theoretical considerations and experimental results on handwritten characters are presented, and the results show that combining multiple classifiers is an effective means of producing highly reliable decisions for both categories of classifiers.
https://doi.org/10.1142/9789812830968_0004
This chapter describes some experiments with the polynomial classifier approach. First we make some general remarks concerning this approach and explain some useful preprocessing algorithms. In the main part of this chapter, three different approaches are introduced and evaluated with some experiments on three different standardized data sets of handprinted digits. First, simple linear classifiers are constructed and tested in order to have a point of reference for the different data sets. With the knowledge gained from these linear classifiers, more sophisticated polynomial structures are constructed to enhance the performance. Secondly, the effects of feature extraction are demonstrated with the Karhunen-Loève transformation. In addition, some results in iterative learning are given. Third, for a fixed classifier approach different structures of the classifier system are explained and the performance effects are demonstrated on the same data sets. The efforts for these different systems in the training and testing stage are explained. At the end some ideas on how to reduce the computational effort for these systems are given.
https://doi.org/10.1142/9789812830968_0005
In this chapter, word recognition algorithms based on segmentation of a word image into primitive components (characters and sub-characters) and concurrent concatenation and recognition of these primitive components will be described. Two approaches to word recognition are discussed: i) context-free recognition where the recognition system yields an optimum letter string and a lexicon is used as a post-processor to select the best match and ii) lexicon directed recognition, where a presegmented word image is matched against all the words of the lexicon to obtain the best match. While word recognition may be based on context-free or lexicon directed techniques, numeral string recognition such as ZIP Code recognition or courtesy amount recognition in a bank check etc. is always based on context-free techniques. The recognition of words in a document follows a hierarchical scheme as described below: i) remove tilt (skew) of the document, ii) extract lines of words from document, iii) remove slant from each line, iv), extract words from each line, v) presegmentation of each word into primitive components (characters and subcharacters), vi) concatenation of components followed by character recognition to recognize each word. In this chapter, lexicon directed word recognition will be discussed in detail. The extension to context-free recognition will be illustrated. Results based on extensive testing with handwritten addresses obtained from the United States Postal Service are presented.
https://doi.org/10.1142/9789812830968_0006
This paper discusses a number of handwritten word recognition (HWR) systems where the segmentation uncertainty, the character transition information, and the shape ambiguity are well characterized by a stochastic model called the Hidden Markov Model (HMM). Some representative schemes and their typical experimental performance are described. It is concluded that despite the difficulties regarding feature extraction and limited available databases, the HMM could be a dominant technique in the field of HWR.
https://doi.org/10.1142/9789812830968_0007
In this chapter, we analyze several on-line cursive handwriting recognition systems. We find that virtually all such systems involve (a) a preprocessor, (b) a trainable classifier, and (c) a language modeling post-processor. Such architectures are described within the framework of Weighted Finite State Transductions, previously used in speech recognition by Pereira et al. We describe in some detail a recognition system built in our laboratory. It is a writer independent system which can handle a variety of writing styles including cursive script and handprint. The input to the system encodes the pen trajectory as a time-ordered sequence of feature vectors. A Time Delay Neural Network is used to estimate a posteriori probabilities for characters in a word. A Hidden Markov Model segments the word in a way which optimizes the global word score, taking a lexicon into account. The last part of the chapter is devoted to bibliographical notes.
https://doi.org/10.1142/9789812830968_0008
In this chapter, we give an overview of the state-of-the-art techniques for improving recognition results of OCR systems. OCR results may contain segmentation as well as classification errors due to low image quality. Such errors can often be corrected by contextual post-processing. We will present the most important techniques for the post-processing of OCR results: voting techniques, lexical post-processing as well as techniques that consider the word or document context. Voting techniques combine the recognition results from multiple OCR devices, typically without utilizing any contextual knowledge. Other post-processing techniques are able to correct remaining OCR errors by employing various sources of contextual knowledge. Lexical post-processing, for example, makes use of knowledge about valid words of natural language. More sophisticated techniques integrating knowledge about the word context or even the entire document context can also be applied to further improve the quality of OCR results. The most useful is the incorporation of knowledge about valid word sequences. In general, post-processing of recognition results considerably improves the OCR accuracy if various kinds of contextual knowledge beyond the level of individual characters are utilized.
https://doi.org/10.1142/9789812830968_0009
Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. The only known commercial products for Han-based languages are for Japanese. Systems that can handle a mixture of languages or even single languages without prior identification of the language are unknown. Languages and their scripts have attributes that make it possible to determine the language of a document automatically. Detection of the values of these attributes requires the recognition of particular features of the document image and, in the case of Latin-based languages, the character syntax of the underlying language. We show the process of roughly classifying documents as either Latin-based or Han-based, and within those classes identify the Latin-based language as being one of 23, and the Han-based as being either Chinese, Japanese or Korean.
https://doi.org/10.1142/9789812830968_0010
Chinese ideographs evolved from pictures several thousand years ago and, unlike western alphabets, are still undergoing change. The large size and open-ended nature of the character set, the geometric complexity of some characters, the stroke-oriented structure, and the elusive relationship between symbol, sound, and shape, all have important implications not only for OCR and DIA, but also for other aspects of computer input/output and for text processing. After reviewing the more common character encoding methods, we examine Chinese typographic practices: orientation, reading order, punctuation, spacing, typefaces, and stylistic variants. We survey briefly alternative input and output methods and discuss their relevance to OCR and DIA. Ideographs (Kanji) also form an important part of the Japanese writing system which, however, includes several alphabetic (Kana) components. We proceed with a parallel treatment for Japanese symbols, layout, encoding, and typography, and point out similarities and differences with respect to both Chinese and Western writing systems.
https://doi.org/10.1142/9789812830968_0011
This chapter describes the principles, methods and practices of printed Chinese character recognition. A stroke structural feature based statistical recognition method, feature extraction and classifier design are described for solving the especially difficult problem faced by Chinese OCR systems. In addition, some preprocessing (such as binarization, normalization and segmentation) and postprocessing are introduced for improving recognition performance. Lastly, some exciting developments in practical printed Chinese character recognition system are described.
https://doi.org/10.1142/9789812830968_0012
Research interest in Chinese character recognition in Taiwan in recent years has been intense, due in part to cultural considerations, and in part to advances in computer hardware development. This chapter addresses coarse character classification, candidate selection, statistical character recognition, recognition based on structural character primitives such as line segments, strokes and radicals, as well as postprocessing and model development.
Coarse character classification and candidate selection are used to reduce matching complexity; statistical methods of character recognition are shown to be effective feature-matching which shows good performance is reported; and, structural-based methods able to distinguish between similar characters are investigated thoroughly. Since no temporal information is available for off-line recognition systems, the character test base is still limited. Methods used to extract structural primitives are also investigated.
Language models based on syntactical or semantic considerations are used to select the most probable characters from sets of candidates, and are applied in postprocessing in input sentence images. These models generally employ the dynamic programming methods. To increase identification capacity, various ways of grouping Chinese words into a reasonable number of classes are also proposed.
https://doi.org/10.1142/9789812830968_0013
Recognition of Japanese machine-printed documents poses several challenges. The variation of document layout styles, vertical and horizontal text alignment, mixed pitch characters, and the large character set size (over 3000 in everyday use), contribute to the complexity of Japanese OCR system design. Variations in font styles and the structurally complex character set are other contributing factors in design. After a brief overview of previous Japanese OCR research, we outline the directions of research in Japanese OCR. To illustrate these issues we present details in the design and performance of a Japanese OCR system developed at CEDAR.
https://doi.org/10.1142/9789812830968_0014
This paper describes an HMM network-based approach to online recognition of Korean Hangul characters. The primary concern of the approach is hidden Markov modeling of letters and explicit modeling of inter-letter patterns or ligatures appearing in cursive script. By using ligature HMMs the variability of inter-letter patterns is resolved. By alternate concatenation of the two kinds of HMMs, a network model for all legal Korean characters has been designed. Given the network, the recognition problem is formulated as that of finding the most likely path from the start to the end node. A DP-based search for optimal input-network alignment gives simultaneous letter segmentation and character recognition — up to 93.3% correct on unconstrained samples.
https://doi.org/10.1142/9789812830968_0015
Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic characters because of the complexity of printed and handwritten text, and this problem is still an open research field. The main objective of this chapter is to present the state of Arabic character recognition research throughout the last 15 years for both on-line and off-line recognition.
https://doi.org/10.1142/9789812830968_0016
Document understanding has attained a level of maturity that requires migration from ad-hoc experimental systems, each of which employs its own set of assumptions and terms, into a solid, standard frame of reference, with generic definitions that are agreed upon by the document understanding community
The logical structure of a document conveys semantic information that is beyond the document's character string contents. To capture this additional semantics, document understanding must relate the document's physical layout to its logical structure. This work provides a formal definition of the logical structure of text-intensive documents. A generic framework using a hierarchy of textons is described for the interpretation of any text-intensive document's logical structure. The recursive definition of textons provides a powerful and flexible tool that is not restricted by the size or complexity of the document. Frames are analogously used as recursive constructs for the physical structure description.
To facilitate the reverse engineering process which is required to derive the logical structure, we describe DAFS, a Document Attribute Format Specification, and demonstrate how our framework can serve as a conceptual framework for enhancements of DAFS.
https://doi.org/10.1142/9789812830968_0017
Technical documents are documents of some science or engineering domain, in which graphics provide the basis for the information conveyed by the document. Engineering drawings in general and mechanical engineering drawings in particular are a typical example. The interpretation of such drawings entails converting them into some kind of CAD representation, which has a high-level meaning, including 3D structure. In this chapter, we review the different methods and techniques used to achieve this goal, which is still an active research topic. After presenting the general framework in which such a system must work, we introduce the three levels of interpretation: lexical, syntactic and semantic. We give an overview of the low- and intermediate-level tools used in this context: vectorization, recognition of arcs, hatching and dashed lines, etc. At the syntactic level, we describe the task of recognizing annotations in general and dimensioning in particular. We also elaborate on the functional analysis that is possible on each view. Finally, we describe the available techniques and the open problems for 3D reconstruction from several views.
https://doi.org/10.1142/9789812830968_0018
Automatic analysis of images of printed forms is a problem of both practical and theoretical interest, due to its importance in office automation, and due to the conceptual challenges posed for document image analysis. The automatic reading of optically scanned forms consists of two major components. The first is the extraction of the data image from the form; the second is the interpretation of the image as coded alphanumerics, and is commonly referred to as optical character recognition (OCR). The individual steps involved in forms analysis include image pre-processing, forms identification, field extraction, data interpretation, and contextual post-processing. In this chapter we outline the issues and current state-of-the-art in the analysis of printed forms. Forms analysis poses several challenges, given the enormous variety of current form layouts and contents, and some of these research issues are explored. In particular, two current forms analysis systems developed at CEDAR are described.
https://doi.org/10.1142/9789812830968_0019
Feature extraction and symbol/character recognition from a paper-based map is discussed. The objective is a conversion from a scanned image to a computer internal representation using an appropriate data structure; roughly, we describe a raster-to-vector conversion. A method using the MAP concept (Multiple-Angled (directional) feature planes with Parallel operation and matching) is proposed. The MAP Operation Method extracts geometric features using directional erosion-dilation operations, and the MAP Matching Method recognizes fixed-shaped symbols by overall directional template matching. In both of these methods, feature representation on multiple directional planes and parallel operations are consistently used. During this raster-to-vector conversion, the following points are stressed: raster rather than vector representation, parallelism and directionality, independent extraction of each feature, and simultaneous segmentation and recognition. Through this example, problems of conventional methods, limits of the proposed approach, and future problems are discussed.
https://doi.org/10.1142/9789812830968_0020
This chapter describes aspects of methods for map interpretation. Maps are made according to rules, but this only slightly simplifies the design process of interpretation systems. This is because the same map can be drawn by different people having different styles. Also, updates on the same map can be months apart, causing aging effects. Sometimes, the rules are more obeyed, sometimes, they are less obeyed. A robust map interpretation system has to take this all into account.
In this chapter two systems for cadastral map interpretation are compared: a bottom-up system and a model-based system. The model-based system uses expectations of reality to correct and improve the interpretation. Apart from a better performance, a number of problems faced in the bottom-up system can be solved more easily in a model-based system. Also, parts of the model-based system can be re-used in an interpretation system for other types of maps.
https://doi.org/10.1142/9789812830968_0021
Recognition of mathematical notation involves two main components: symbol recognition and symbol-arrangement analysis. Symbol-arrangement analysis is particularly difficult for mathematics, due to the subtle use of space in this notation. We begin with a general discussion of the mathematics-recognition problem. This is followed by a review of existing approaches to mathematics recognition, including syntactic methods, projection-profile cutting, graph rewriting, and procedurally-coded math syntax. A central problem in all recognition approaches is to find a convenient, expressive, and effective method for representing the notational conventions of mathematics.
https://doi.org/10.1142/9789812830968_0022
The aim of Optical Music Recognition (OMR) is to convert optically scanned pages of music into a machine-readable format. In this tutorial level discussion of the topic, an historical background of work is presented, followed by a detailed explanation of the four key stages to an OMR system: stave line identification, musical object location, symbol identification, and musical understanding. The chapter also shows how recent work has addressed the issues of touching and fragmented objects—objectives that must be solved in a practical OMR system. The report concludes by discussing remaining problems, including measuring accuracy.
https://doi.org/10.1142/9789812830968_0023
In this paper an introduction to automatic signature verification is presented. The concepts of dynamic and static signature verification are addressed and the fundamental phases in the process of signature verification are discussed. Throughout the paper, the activities of the main research teams working in the field are presented and steps towards the development of advanced systems are described.
https://doi.org/10.1142/9789812830968_0024
This chapter discusses recent developments of bank check analysis and recognition by computers. A technique for locating the courtesy amount block on bank checks and a deterministic finite autmaton for the segmentation and recognition of courtesy amount will be presented. In the analysis and recognition process, connected components in the image are identified first. Then, strings are constructed on the basis of proximity and horizontal alignment of characters. Next, a set of rules and heuristics are applied to these strings to choose the correct one. The chosen string is only accepted if it passes a verification test, which includes an attempt to recognize the currency sign. A deterministic finite automaton system is then used for segmenting the handprinted courtesy amount. Finally, the separated components are passed on to a neural network based recognition system.
https://doi.org/10.1142/9789812830968_0025
Extracting information from paper documents opens a variety of innovative applications by supporting people in their daily processing of documents. In this chapter, a system that interprets text on paper documents given the restricted domain of a certain application is presented. The system consists of four components. The Document Image Analysis component transforms the text of the scanned document image into an electronic format represented by a sequence of word hypotheses. Based on this sequence, three components extract the information necessary for automatic processing of documents. First, the information being enclosed in structured text is extracted, such as the sender and recipient of business letters, or title and author of scientific papers. Second, the text body of a message is mapped to a certain pre-defined category. In the final step, this text is analyzed and the information which is relevant for the current application is extracted. It is shown that for a real-world application the paper documents can be completely interpreted, resulting in an automatically generated answering letter. The system is fast, fault tolerant with respect to misspelling or recognition errors, and readily adaptable to new applications.
https://doi.org/10.1142/9789812830968_0026
A system for the off-line recognition and execution of correction instructions on text documents is described. It is assumed that a corrector manually marks corrections on a paper document using a color pen. Then the document is scanned. Our system automatically detects the correction marks, infers their meaning, and updates the underlying text file according to the desired corrections. A prototype of the system has been implemented and successfully tested on a number of documents.
https://doi.org/10.1142/9789812830968_0027
Braille is the most widely used system for written communication using tactile means. For visually handicapped people, Braille documents have proved to be an invaluable source of information, essential for education and for leisure. Although more Braille documents are produced today, there are still very few copies of the small number of books, for instance, available. Automation of the Braille reading process will effectively enable the transcription and duplication of existing documents as well as their preservation. By convening a Braille document into electronic form, all the benefits of developments for the electronic form of printed documents can be enjoyed: easier transmission, efficient storage and manipulation etc. The possibility of sighted people also using an automatic Braille reading system to communicate in writing with blind people is a further motivation. In this chapter, the aspens of the automatic reading problem are analysed and solutions to each of them are examined. Firstly, the characteristics and production of Braille and Braille documents are described and the issues involved in the problem are discussed. Secondly, each of the stages towards the automatic reading of Braille documents is analysed and the approaches that have been so far designed to carry it out are critically presented. Finally, the most important factors affecting the success of the whole process are discussed.
https://doi.org/10.1142/9789812830968_0028
The rapid growth of the World-Wide Web (WWW) offers digital image analysis and optical character recognition researchers a golden opportunity to bring ideas to bear on an important and timely application. To focus our discussion, we examine the needs of the Internet community for archival technical material. Some of these needs can be satisfied by posting articles now on paper in coded text form, others by posting them in facsimile image form. Most can be satisfied by a combination of both forms, but dual representation imposes certain additional requirements on document analysis. Potential sources of archival electronic documents include existing CD-ROM databases, publishers, libraries, and individual members of the Internet community. However, the cost of manual conversion of a significant portion of the pre-1980 technical material is staggering. This material will remain unavailable on the network unless certain specific optical character recognition (OCR) and document image analysis (DIA) tasks, most of which are the subject of current research, can be automated.
https://doi.org/10.1142/9789812830968_0029
This chapter undertakes an in-depth analysis of the interaction between OCR and information retrieval. In particular, after providing an introduction to information retrieval, we report on retrieval effectiveness from OCR generated document collections. It will be shown that, in general, average precision and recall is not affected while document term assignment, weighting and document ranking may be affected. We also point out that even though OCR errors do not affect average retrieval effectiveness, there are other consequences that should be considered when OCR text is applied.
https://doi.org/10.1142/9789812830968_0030
Several significant sets of labeled samples of image data are surveyed that can be used in the development of algorithms for offline and online handwriting recognition as well as for machine printed text recognition. The method used to gather each data set, the numbers of samples they contain, and the associated truth data are discussed. In the domain of offline handwriting, the CEDAR, NIST, and CENPARMI data sets are presented. These contain primarily isolated digits and alphabetic characters. The UNIPEN data set of online handwriting was collected from a number of independent sources and it contains individual characters as well as handwritten phrases. The University of Washington document image databases are also discussed. They contain a large number of English and Japanese document images that were selected from a range of publications.
https://doi.org/10.1142/9789812830968_0031
Benchmarking document image analysis (DIA) systems is important for users, for researchers, and for developers of these technologies. From the user's point of view, the appropriate measure of a DIA system is the total cost of document conversion. This cost is typically dominated by the cost of correcting residual errors in the output.
The cost measure needed by a user in selecting a system, however, is not necessarily appropriate for a researcher who seeks to measure progress in developing a new algorithm. Furthermore, the task of comparing the performance of two algorithms, whose internal operation is visible, is quite different from that of comparing several large systems whose source code is a carefully guarded secret. The latter is referred to as “black-box” testing.
All DIA systems are composed of different sub-algorithms often applied in sequence. Measuring the performance of such systems when neither the algorithms nor the sequence is known is an especially difficult task.
In this chapter, we present the important issues in benchmark testing from the “black-box” point of view by reviewing a selection of performance metrics for systems that recognize text from images of machine-printed pages. We present the functional steps for automated evaluation of OCR system performance. We also discuss the relevance of these steps to benchmarking other DIA technologies.
https://doi.org/10.1142/9789812830968_bmatter
The following sections are included: