DOCUMENT UNDERSTANDING BASED ON MAXIMUM A POSTERIORI PROBABILITY ESTIMATION
In this paper, a new method for extracting document structure is presented. This method is especially suitable for extracting character lines from an unformatted document image. Because character line extraction has a great influence on all sub-sequent processes, extraction of correct lines is one of the most important issues in document image understanding. Maximum a posteriori probability estimation is used to solve this problem by modeling the knowledge of target documents and extracting the most suitable document structures for input images. With some assumption, the model can be expressed very simply. Moreover the model is automatically obtained by learning from sample target images. This function enables the proposed system to perform detailed handling of diverse styled documents without parametric tuning by a human operator.