Text, photo, and line extraction in scanned documents

M. Sezer Erkilinc; Mustafa I. Jaber; Eli Saber; Peter Bauer; Dejan Depalov

doi:10.1117/1.JEI.21.3.033006

13 July 2012 Text, photo, and line extraction in scanned documents

M. Sezer Erkilinc, Mustafa I. Jaber, Eli Saber, Peter Bauer, Dejan Depalov

Author Affiliations +

Journal of Electronic Imaging, Vol. 21, Issue 3, 033006 (July 2012). https://doi.org/10.1117/1.JEI.21.3.033006

Abstract

We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of ∼ 89% classification accuracy in text, photo, and background regions.

Citation Download Citation

M. Sezer Erkilinc, Mustafa I. Jaber, Eli Saber, Peter Bauer, and Dejan Depalov "Text, photo, and line extraction in scanned documents," Journal of Electronic Imaging 21(3), 033006 (13 July 2012). https://doi.org/10.1117/1.JEI.21.3.033006

Published: 13 July 2012

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available