A Query Engine for Retrieving Information from Chinese HTML Documents
Abstract
The amount of online information in Chinese and the number of Chinese Internet users have been increasing tremendously during the past decade. Since Chinese language is significantly different from English, techniques that have been developed for retrieving information from English Web documents cannot be directly applied to retrieve information from Chinese Web documents. In order to provide high-performance access of Chinese information on the Web, we have developed a Chinese Web query engine that (i) extracts (hierarchical) data of interest from Chinese HTML tables using an information extraction tool called semantic hierarchy, (ii) allows the user to submit queries in Chinese using a menu-driven user interface, and (iii) processes the user's queries (as Boolean expressions) to generate the correct results. Our query engine supports various groups of information that are categorized into various subject areas, such as car ads, house rentals, job ads, stocks, university catalogs, etc. We have tested our information extraction tool on two application domains, car-ads and house-rental. The average F-measure on extracting Chinese data from these two application domains is above 90%. More importantly, our query engine can easily be configured and internationalized to become a worldwide, multilingual query engine with minor changes in system settings on PCs running Windows operating systems.