pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Fischer <fischer...@aon.at>
Subject Re: PDF Box to parse the data content on Images
Date Tue, 01 May 2012 16:54:23 GMT

PDFBox won't hep you with this, it extracts only text from PDF files.
For images you need some OCR (Optical character recognition) application, these are available
in either commercial (e.g. Abby Fine Reader) or free (e.g. Tesseract) versions. The EuDML
project is working on a package that does what you want, see PdfToTextViaOCR.


Am 30.04.2012 um 15:31 schrieb chaya jajur:

> Hi Team,
> We are planning to use PDFBox to parse PDF content.  I am able to parse and
> read the normal text data in PDF,
> but I am having challenges in reading the data/ content on images.
> Our requirement is we need to read & parse the data/ content present on top
> of images also.
> ex: If i have scanned copy of a document , I should be able to parse the
> content of that also.
> Please suggest me on how to proceed this.
> Thanks In Advance
> Chaya

View raw message