Thomas' Blog: December 2010

Hello web and printed document analysis people

Some of you may find this paper interesting. They advocate using layout analysis to improve information extraction from web pages and in preliminary results find that rendering a web page and treating it like a printed document provides better clues than using the HTML DOM tree.

Online Medical Journal Article Layout Analysis

There are other people who believe the same thing:

Towards Domain Independent Information Extraction from Web Tables

Extracting Semantic Structure of Web Documents Using Content and Visual Information

This makes sense as web page writers are primarily concerned with how the browser renders their page, and how their audience reads the rendered page, not with how the HTML code looks.

Within a domain, visual layout can be used in the same way as domain knowledge (ontologies), in that both sources of information tend to be more invariant across documents within a domain than the particular HTML tags used.

Reasons given in the paper for not relying on the DOM tree:

1. The DOM nodes may not be in a semantically meaningful order.
2. Simple text lines can be broken into several nodes at different levels of a DOM tree.
3. Visually similar pages can have completely different DOM trees.

Of course, an approach that combines DOM tree information, domain knowledge and visual layout is probably best. This gives me some assurance that I’m not wasting my time focusing my research on printed documents in which there are no HTML tags. It is also applicable to web extraction.

Thomas' Blog

Pages

2010-12-03

Web Page Layout Analysis

Followers