Paper
7 February 2011 DOM-based print-link detection for web article extraction
Author Affiliations +
Proceedings Volume 7879, Imaging and Printing in a Web 2.0 World II; 787904 (2011) https://doi.org/10.1117/12.872573
Event: IS&T/SPIE Electronic Imaging, 2011, San Francisco Airport, California, United States
Abstract
Web article pages usually have hyperlinks (or links) that lead to print-friendly web pages containing mainly the article content. Content extraction using these print-friendly pages is generally easier and more reliable, but there are many variations of the print-link representations in HTML that made robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as "print", "print article", "print-friendly version", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate the matter further, not all the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so no URL is available for extraction. We estimate that there are more than 90% of the Web article pages have print-links, of which about 35% of them have valid print-friendly URLs, which is a good percentage. Our solution to the print-link extraction problem takes on two stages: (1) the detection of the print-link, (2) the retrieval of the print-friendly page URL from the link attributes, including the test for its validity. Experimental results based on roughly 2000 web article pages suggest our solution is capable of achieving over 99% precision and 97% recall performance measures.
© (2011) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Sam Liu, Suk-Hwan Lim, and Jerry Liu "DOM-based print-link detection for web article extraction", Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 787904 (7 February 2011); https://doi.org/10.1117/12.872573
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Associative arrays

Printing

Information visualization

Java

Advanced distributed simulations

Fluctuations and noise

Infrared imaging

RELATED CONTENT

A visualization architecture for intelligence analysis
Proceedings of SPIE (June 04 2004)
Utilizing visualization for shared knowledge spaces
Proceedings of SPIE (April 09 1997)
Text-based search of TV news stories
Proceedings of SPIE (November 01 1996)
A URL shortener for mobile web consumption
Proceedings of SPIE (February 21 2012)
Multilingual mapping based on XML-SVG
Proceedings of SPIE (November 03 2008)
HP Smart Print
Proceedings of SPIE (February 21 2012)

Back to Top