Document Image Analysis Using Imagemagick and Tesseract-ocr
2016; Volume: 3; Issue: 5 Linguagem: Inglês
10.17148/iarjset.2016.3523
ISSN2394-1588
AutoresParra-Calder oacute n Carlos L., Antony P. J, D N Sachin,
Tópico(s)Handwritten Text Recognition Techniques
ResumoDocument image analysis is the field of converting paper documents into an editable electronic representation by performing optical character recognition (OCR).In recent years, there has been a tremendous amount of progress in the development of open source OCR systems.The tesseract-ocr engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview.Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.OCRopus is one of the leading open source document analysis systems using tesseract-ocr with a modular and pluggable architecture.Imagemagick is an open source image processing tool.This paper presents an overview of different steps involved in a document image analysis system and illustrates them with examples from Combination of imagemagick and OCRopus.
Referência(s)