World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Learning mDTD Extraction Patterns for Semi-structured Web Information Extraction

    https://doi.org/10.1142/S0219427902000492Cited by:1 (Source: Crossref)

    This paper presents a new extraction pattern, called a modified Document Type Definition (mDTD), which relies on an analytical interpretation method to identify textual fragments of documents from the Web. We make two major modifications which differ from the conventional DTD. Regarding syntax, we introduced an extended content model with more operators and keywords. For the semantics, we changed the way to interpret the mDTD rules. The mDTD can represent HTML structures and extraction targets. The design goal of mDTD is to overcome major barriers with information extraction, such as domain portability with minimum human intervention while maintaining a high extraction performance. The user composes mDTD as seed rules from which our system extracts instances from structured documents on the Web. These extracted instances are used as inputs to SmL (Sequential mDTD Learner). SmL generates new mDTD rules that are based on the part-of-speech (POS) tag and the lexical similarity features. For learning, a hand-tagged corpus is not required. We experimented with 200 Web documents on audio and video shopping sites. The results show 91.3% average extraction precision.

    This research was supported by BK21 program of the Ministry of Education, Korea.