Capítulo de livro

Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

2009; Linguagem: Inglês

10.1007/978-1-84800-330-9_10

ISSN

1617-7916

Autores

Mudit Agrawal, Huanfeng Ma, David Doermann,

Tópico(s)

Image Processing and 3D Reconstruction

Resumo

In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

Referência(s)
Altmetric
PlumX