Artigo Acesso aberto Revisado por pares

Learning-free, divide and conquer text-line extraction algorithm for printed Arabic text with diacritics

2022; Elsevier BV; Volume: 34; Issue: 9 Linguagem: Inglês

10.1016/j.jksuci.2022.04.021

ISSN

2213-1248

Autores

Aziz Qaroush, Abdalkarim Awad, Abualsoud Hanani, Khader Mohammad, Basam Jaber, Ala Hasheesh,

Tópico(s)

Vehicle License Plate Recognition

Resumo

The extraction of text lines from document images is a critical step in optical character recognition. It is still considered an open document analysis problem. The presence of numerous font variations, diacritics, overlapping, and touching text-lines presents a challenge to algorithms designed for machine-printed text. In this paper, we present a simple and robust text-line extraction algorithm for printed Arabic text. The presented method is divided into two stages: preprocessing and text-line extraction. It extracts text-lines efficiently, even in small font sizes, by utilizing baselines, projection profiles, and a top-down divide and conquer technique. The presented method is fast and efficient when dealing with non-uniform inter-line spacing and the text-line overlapping problem. A set of experiments were conducted on the collected dataset. The experiments revealed that the proposed method outperforms two baseline approaches, with an average error rate of 3% on Arabic text without diacritics and 11% on Arabic text with diacritics. Furthermore, the experiments demonstrate that the proposed algorithm has a simple computational running time, with an average running time of 0.087 s per document image.

Referência(s)