pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremias Maerki <...@jeremias-maerki.ch>
Subject Re: extracting text from image using pdfbox
Date Sun, 14 Oct 2012 08:09:00 GMT
Hi,
Apache PDFBox can't help you here, I'm afraid. What you're after is OCR
functionality (http://en.wikipedia.org/wiki/Optical_character_recognition)
and PDFBox doesn't provide that. The only thing you can do is to extract
the bitmap images using PDFBox and then attempt to decipher the text
contained in them using an external OCR process. Just a warning: don't
expect an OCR process to be 100% accurate.

If you're looking for an open source OCR engine, Tesseract is probably
the most popular one: http://en.wikipedia.org/wiki/Tesseract_%28software%29

HTH
Jeremias Maerki


On 12.10.2012 15:47:40 Kishore Babu wrote:
> Hi All,
> Is it possible to extract text from an image (JPEG) using pdfbox or is there any open
source java code for this?
> 
> When I try to  convert pdf to text, it is showing blank output. Then I converted into
JPEG image. The image contains the text properly, which I am failing to extract.
> 
> For normal pdf documents I am extracting text nicely using the standard process but when
the pdf document is an image, I am failing to extract the text that is present in the image.
> 
> Can anyone give directions on this, please?
> 
> Thanks in advance.
> 
> Regards,
> Kishore Babu I Developer


Mime
View raw message