This book provides the first comprehensive look at the emerging field of web document analysis. It sets the scene in this new field by combining state-of-the-art reviews of challenges and opportunities with research papers by leading researchers. Readers will find in-depth discussions on the many diverse and interdisciplinary areas within the field, including web image processing, applications of machine learning and graph theories for content extraction and web mining, adaptive web content delivery, multimedia document modeling and human interactive proofs for web security.
https://doi.org/10.1142/9789812775375_fmatter
The following sections are included:
https://doi.org/10.1142/9789812775375_0001
In this chapter we enhance the representation of web documents by utilizing graphs instead of vectors. In typical content-based representations of web documents based on the popular vector model, the structural (term adjacency and term location) information cannot be used for clustering. We have created a new framework for extending traditional numerical vector-based clustering algorithms to work with graphs. This approach is demonstrated by an extended version of the classical k-means clustering algorithm which uses the maximum common subgraph distance measure and the concept of median graphs in the place of the usual distance and centroid calculations, respectively. An interesting feature of our approach is that the determination of the maximum common subgraph for measuring graph similarity, which is an NP-Complete problem, becomes polynomial time with our graph representation. By applying this graph-based k-means algorithm to the graph model we demonstrate a superior performance when clustering a collection of web documents.
https://doi.org/10.1142/9789812775375_0002
In this chapter, we describe our steps towards adapting a new approach for graph comparison known as graph probing to collections of semi-structured documents (e.g., Web pages coded in HTML). We consider both the comparison of two graphs in their entirety, as well as determining whether one graph contains a subgraph that closely matches the other. A formalism is presented that allows us to prove graph probing yields a lower bound on the true edit distance between graphs. Results from several experimental studies demonstrate the applicability of the approach, showing that graph probing can distinguish the kinds of similarity of interest and that it can be computed efficiently.
https://doi.org/10.1142/9789812775375_0003
Our approach to extracting information from the Web is to analyze the structural content of web pages through exploiting the latent information given by HTML tags. For each specific extraction task, an object model is created consisting of the salient fields to be extracted and the corresponding extraction rules based on a library of HTML parsing functions. We derive extraction rules for both single-slot and multiple-slot extraction tasks which we illustrate through three sample applications.
https://doi.org/10.1142/9789812775375_0004
In this chapter we present an approach to the analysis of web documents — and other electronically available document collections — that is based on the combination of XML technology with NLP techniques. A key issue addressed is to offer end-users a collection of highly interoperable and flexible tools for their experiments with document collections. These tools should be easy to use and as robust as possible. XML is chosen as a uniform encoding for all kinds of data: input and output of modules, process information and linguistic resources. This allows effective sharing and reuse of generic solutions for many tasks (e.g., search, presentation, statistics and transformation).
https://doi.org/10.1142/9789812775375_0005
Millions of documents on the Internet exist in page or image oriented formats like PDF. Such documents are currently difficult to read on-screen and on handheld devices. This paper describes a system for the automatic analysis of a document image into atomic fragments (e.g. word images) that can be reconstructed or "reflowed" onto a display device of arbitrary size, depth, and aspect ratio. This allows scans and other page-image documents to be viewed effectively on a limited-resolution screen or hand-held computing device, without any errors and losses due to OCR and retypesetting. The methods of image analysis and representation are described.
https://doi.org/10.1142/9789812775375_0006
In recent times, the way people access information from the web has undergone a transformation. The demand for information to be accessible from anywhere, anytime, has resulted in the introduction of Personal Digital Assistants (PDAs) and cellular phones that are able to browse the web and can be used to find information using wireless connections. However, the small display form factor of these portable devices greatly diminishes the rate at which these sites can be browsed. Efficient algorithms are required to extract the content of web pages and build a faithful reproduction of the original pages on a smaller display with the important content intact.
https://doi.org/10.1142/9789812775375_0007
In this chapter, we present an approach to automatically analyzing the semantic structure of HTML pages based on detecting visual similarities of the content objects on these pages. The approach is developed based on the observation that, in most web pages, layout styles of the subtitles or records of the same content category are consistent and there are apparent separation boundaries between different categories. Thus, these subtitles should have similar appearance if they are rendered in visual browsers and these different categories can be separated clearly. In our approach, we first measure the visual similarities of HTML content objects. Then we apply a pattern detection algorithm to detect frequent patterns of visual similarity and use a number of heuristics to choose the most possible patterns. By grouping items according to these patterns, we finally build a hierarchical representation (tree) of the HTML document with semantics inferred from "visual consistency". Preliminary experimental results show promising performance of the method with real web pages.
https://doi.org/10.1142/9789812775375_0008
Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as < table > elements, a < table > element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the web domain is to identify the genuine tables. In this chapter we explore a machine learning based approach for automatic table detection in HTML documents. Various features reflecting the layout as well as content characteristics of tables are explored. Two different classifiers, the decision tree classifier and the Support Vector Machines, are investigated. The system is tested on a large database which consists of 1,393 HTML files collected from hundreds of different web sites from various domains and contains over 10,000 leaf < table > elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed a previously designed rule-based system and achieved an F-measure of 95.88%.
https://doi.org/10.1142/9789812775375_0009
A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include document-object model (DOM)-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text as it will be rendered. The learning system described is part of an "industrial-strength" wrapper management system. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.
https://doi.org/10.1142/9789812775375_0010
We propose a method for extracting attributes and their values from Web pages. Our method models Web pages by Hidden Marcov Models (HMMs) whose parameters are estimated with no manual interventions such as labeling of training samples. The key idea is to estimate some of the HMM parameters by consulting ontologies built from HTML tables, while other parameters are estimated via the Expectation Maximization (EM) algorithm. In the experiments, we show that our algorithm can extract attributes and values from various kinds of attribute-value expressions.
https://doi.org/10.1142/9789812775375_0011
This chapter describes a new approach for the segmentation of text in images on Web pages. In the same spirit as the authors' previous work on this subject, this approach attempts to model the ability of humans to differentiate between colours. In this case, pixels of similar colour are first grouped using a colour distance defined in a perceptually uniform colour space (as opposed to the commonly used RGB). The resulting colour connected components are then grouped to form larger (character-like) regions with the aid of a propinquity measure, which is the output of a fuzzy inference system. This measure expresses the likelihood for merging two components based on two features. The first feature is the colour distance between the components, in the L*a*b* colour space. The second feature expresses the topological relationship of two components. The results of the method indicate a better performance than previous methods devised by the authors and possibly better (a direct comparison is not really possible due to the differences in application domain characteristics between this and previous methods) performance to other existing methods.
https://doi.org/10.1142/9789812775375_0012
Most approaches to searching for images on the Web or in a closed database emphasize image processing techniques. However, images in Web documents are always accompanied by text, both content and markup. This research examines whether this text can be used to accurately identify images on the Web that match textual queries. A prototype search tool was constructed that uses an existing Web search engine to find candidate pages and then analyzes these pages for the presence of several textual clues to image content. We found three simple clues (image file name, page title, and text value of an image's ALT attribute) that showed promise for finding relevant images. This suggests that it will be both more efficient and more effective to use text as the first step in finding images on the Web.
https://doi.org/10.1142/9789812775375_0013
The content of the World-Wide-Web has moved rapidly from text-only to multimedia-rich. As information becomes available more and more as multimedia content, and requires more personalized access, we deem the existing infrastructures inadequate. To enable effective personalized search in Web-based or large-scale image libraries, this chapter proposes a perception-based search paradigm. We present the anatomy of a large-scale image search engine, which includes such a perception-based search component, a multi-resolution image-feature extractor, and a high-dimensional indexer, as well as traditional components such as a crawler and a keyword extractor. Through examples and empirical study, we show that our system is superior over traditional Content-Based Image Retrieval (CBIR) systems in three aspects: personalization, search accuracy and efficiency.
https://doi.org/10.1142/9789812775375_0014
Web services offered for human use are being abused by programs. Efforts to defend against these have, over the last five years, stimulated the development of a new family of security protocols able to distinguish between human and machine users automatically over GUIs and networks. AltaVista pioneered this technology in 1997; by 2000, Yahoo! and PayPal were using similar methods. Researchers at Carnegie-Mellon University2 and, then, a collaboration between the University of California at Berkeley and the Palo Alto Research Center3 developed such tests. By January 2002 the subject was called 'human interactive proofs' (HIPs), defined broadly as challenge/response protocols which allow a human to authenticate herself as a member of a given group: e.g., human (vs. machine), herself (vs. anyone else), etc. All commercial uses of HIPs exploit the gap in reading ability between humans and machines. Thus, many technical issues studied by the document image analysis (DIA) research community are relevant to HIPs. This chapter describes the evolution of HIP R&D, applications of HIPs now and on the horizon, relevant legal issues, highlights of the first NSF HIP workshop, and proposals for a DIA research agenda to advance the state of the art of HIPs.
https://doi.org/10.1142/9789812775375_0015
Many large collections of document images are now becoming available online as part of digital library initiatives, fueled by the explosive growth of the World Wide Web. In this chapter, we examine protocols and system-related issues that arise in attempting to make use of these new resources, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental document analysis research. We also report on our experiences running two simple tests involving data drawn from one such collection. The potential synergies between document analysis and digital libraries could lead to substantial benefits for both communities.
https://doi.org/10.1142/9789812775375_0016
This chapter proposes a new way for authoring multimedia documents. It uses the concept of structured media that allows deeper access into media objects. The proposed models can be considered as a utilization of MPEG7 for intra-media content description and an extension of hierarchical structure, interval and region based model for inter-media composition. An experiment of structuring video in our authoring and presentation environment for multimedia documents, called VideoMadeus, that takes advantage of that model is described. It allows the author to interactively specify video content structure descriptions that can be then used for the composition of video elements (character, shot, scene, etc.) with other media objects (text, sound, image, video, etc.) Such a composition allows to easily realize attractive multimedia presentations where media content can be synchronized in rich and flexible ways such as: tracking an object in a video, attaching hyperlinks to video objects and fine-grained synchronization (for example a piece of text can be synchronized with a video fragment).
https://doi.org/10.1142/9789812775375_0017
Highly promoted by the World Wide Web, documents play a growing role within global information systems. The use of HTML, primarily intended to be the standard representation for hypertext information over the Internet, has been significantly diverted from its initial goal. HTML is often used to specify the global structure of a Web site whose effective content mainly resides within documents such as Postscript or PDF files. Moreover, despite the current evolution of the HTML standard, HTML documents themselves remain mostly presentation oriented. Finally, the XML initiative reinforces the production of, once again, presentation oriented documents, generated on the fly from databases. Document analysis, which aims at extracting symbolic and structured information from physical representation of documents, is obviously provided with a new attractive ground for investigations. The objective of this paper is twofold: on the one hand, it emphasizes the evolution of document models, which drastically affects the goal of recognition process; on the other hand, it provides hints on techniques and methods to be used for facing new Web-based document analysis applications.
https://doi.org/10.1142/9789812775375_bmatter
The following sections are included: