Exchanging image processing and OCR components in a Setswana digitisation pipeline

Artigo Acesso aberto Revisado por pares

Exchanging image processing and OCR components in a Setswana digitisation pipeline

2020; South African Institute of Computer Scientists and Information Technologists; Volume: 32; Issue: 2 Linguagem: Inglês

10.18489/sacj.v32i2.707

ISSN

2313-7835

Autores

Gideon Kotzé, Friedel Wolff,

Tópico(s)

Digital Imaging for Blood Diseases

Resumo

As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Exchanging image processing and OCR components in a Setswana digitisation pipeline