This work provides an open-source method for extracting rel- evant information from scanned documents, such as bills, bank accounts, and invoices. The solution supports documents in 10 different languages and can extract data from these documents irrespective of their template or structure. We have pre-existing solutions based on OpenCV and deep learning technologies, but none provide a generic solution with high accu- racy and support for multiple languages. The proposed method identifies the language of the input document using a pre-trained fast-text model. The document is segmented into different text regions using Run Length Smoothing Algorithm (RLSA). The output of RLSA is passed through a custom pattern recognition algorithm to filter out the regions having the possibility of relevant data based on invoices or account statements. The filtered segments are passed through the Tesseract OCR module for raw text extraction. Based on the identified language of the document, extracted raw text is mapped against the language-specific entity libraries, and final key-value pairs are stored in JSON or CSV files. After being tested on more than 1000 documents, our proposed solution had an average accuracy of 90.27% for all language documents.
|