Regular Articles

Text, photo, and line extraction in scanned documents

[+] Author Affiliations
M. Sezer Erkilinc

University College London, Department of Electronic and Electrical Engineering, Optical Networks Group, Torrington Place, London WC1E 7JE, United Kingdom

Mustafa Jaber

IPPLEX Holdings Corporation, Santa Monica, California 90025

Eli Saber

Rochester Institute of Technology, Department of Electrical and Microelectronic Engineering, Rochester, New York 14623

Peter Bauer

Hewlett-Packard Corporation, Imaging Asset Team, Boise, Idaho 83714

Dejan Depalov

Hewlett-Packard Corporation, Imaging Asset Team, Boise, Idaho 83714

J. Electron. Imaging. 21(3), 033006 (Jul 13, 2012). doi:10.1117/1.JEI.21.3.033006
History: Received November 23, 2011; Revised May 15, 2012; Accepted June 8, 2012
Text Size: A A A

Abstract.  We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of 89% classification accuracy in text, photo, and background regions.

© 2012 SPIE and IS&T

Citation

M. Sezer Erkilinc ; Mustafa Jaber ; Eli Saber ; Peter Bauer and Dejan Depalov
"Text, photo, and line extraction in scanned documents", J. Electron. Imaging. 21(3), 033006 (Jul 13, 2012). ; http://dx.doi.org/10.1117/1.JEI.21.3.033006


Access This Article
Sign in or Create a personal account to Buy this article ($20 for members, $25 for non-members).

Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging & repositioning the boxes below.

Advertisement
  • Don't have an account?
  • Subscribe to the SPIE Digital Library
  • Create a FREE account to sign up for Digital Library content alerts and gain access to institutional subscriptions remotely.
Access This Article
Sign in or Create a personal account to Buy this article ($20 for members, $25 for non-members).
Access This Proceeding
Sign in or Create a personal account to Buy this article ($15 for members, $18 for non-members).
Access This Chapter

Access to SPIE eBooks is limited to subscribing institutions and is not available as part of a personal subscription. Print or electronic versions of individual SPIE books may be purchased via SPIE.org.