Paper
4 February 2013 Evaluating supervised topic models in the presence of OCR errors
Daniel Walker, Eric Ringger, Kevin Seppi
Author Affiliations +
Proceedings Volume 8658, Document Recognition and Retrieval XX; 865812 (2013) https://doi.org/10.1117/12.2008345
Event: IS&T/SPIE Electronic Imaging, 2013, Burlingame, California, United States
Abstract
Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments in which we evaluate three supervised topic models and a naive baseline on synthetic OCR data having various levels of degradation and on real OCR data from two different decades. The evaluation includes experiments with and without feature selection. Our results suggest that supervised topic models are no better, or at least not much better in terms of their robustness to OCR errors, than unsupervised topic models and that feature selection has the mixed result of improving topic quality while harming metadata prediction quality. For users of topic modeling methods on OCR data, supervised topic models do not yet solve the problem of finding better topics than the original unsupervised topic models.
© (2013) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Daniel Walker, Eric Ringger, and Kevin Seppi "Evaluating supervised topic models in the presence of OCR errors", Proc. SPIE 8658, Document Recognition and Retrieval XX, 865812 (4 February 2013); https://doi.org/10.1117/12.2008345
Lens.org Logo
CITATIONS
Cited by 3 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Data modeling

Feature selection

Optical character recognition

Performance modeling

Machine learning

Statistical modeling

Blood

Back to Top