CHINESE WORD SEARCHING IN IMAGED DOCUMENTS
Abstract
An approach to searching for user-specified words in imaged Chinese documents, without the requirements of layout analysis and OCR processing of the entire documents, is proposed in this paper. A small number of Chinese characters that cannot be successfully bounded using connected component analysis due to larger gaps between elements within the characters are blacklisted. A suitable character that is not included in the blacklist is chosen from the user-specified word as the initial character to search for a matching candidate in the document. Once a matched candidate is found, the adjacent characters in the horizontal and vertical directions are examined for matching with other corresponding characters in the user-specified word, subject to the constraints of alignment (either horizontal or vertical direction) and size similarity. A weighted Hausdorff distance is proposed for the character matching. Experimental results show that the present method can effectively search the user-specified Chinese words from the document images with the format of either horizontal or vertical text lines, or both appearing on the same image.