Learning mDTD Extraction Patterns for Semi-structured Web Information Extraction
Abstract
This paper presents a new extraction pattern, called a modified Document Type Definition (mDTD), which relies on an analytical interpretation method to identify textual fragments of documents from the Web. We make two major modifications which differ from the conventional DTD. Regarding syntax, we introduced an extended content model with more operators and keywords. For the semantics, we changed the way to interpret the mDTD rules. The mDTD can represent HTML structures and extraction targets. The design goal of mDTD is to overcome major barriers with information extraction, such as domain portability with minimum human intervention while maintaining a high extraction performance. The user composes mDTD as seed rules from which our system extracts instances from structured documents on the Web. These extracted instances are used as inputs to SmL (Sequential mDTD Learner). SmL generates new mDTD rules that are based on the part-of-speech (POS) tag and the lexical similarity features. For learning, a hand-tagged corpus is not required. We experimented with 200 Web documents on audio and video shopping sites. The results show 91.3% average extraction precision.
This research was supported by BK21 program of the Ministry of Education, Korea.