World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Configurable Customized Information Extraction and Processing Pipeline

    https://doi.org/10.1142/S0218001424590122Cited by:0 (Source: Crossref)

    Extracting information from scanned business documents, while a necessary commercial task, continues to be mostly done manually, requiring significant human effort. Current solutions for automated document information extraction still have limited capabilities in regards to user-required customizability and extraction of dataset-specific information, leaving the area as a very active field of research. In this paper, we propose modifications and improvements to our previously developed custom pipeline for extracting and tabulating key-value pairs from commercial invoice documents. Our design changes and additions adapt the pipeline to a wider variety of document types and use cases, primarily through the implementation of dataset-specific configuration files that promote customizability along with new technical modules that address both general and dataset-specific complexities. We compare our pipeline’s performance against current machine learning and commercial solutions on a real-world dataset, and demonstrate that it is able to extract a wider variety of fields while maintaining competitive or greater accuracies compared to the alternate solutions.