pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <patrick.nich...@agencyport.com>
Subject Converting a form with internal font to image
Date Tue, 13 Mar 2012 17:11:57 GMT
I have a pdf:
http://dl.dropbox.com/u/28209500/Pages%20from%2011-12%20CA%20Apps.pdf

which appears to use an internal font encoding, so I am unable to extract the text with PDFBox.
(Actually, I have lots and lots of forms with this same problem).

I thought as an alternative, I could use pdfbox to convert the form to an image, then use
OCR to process the pdf. However, when I convert to an image with pdfbox, the text comes out
as gibberish as well.

So, am I correct in assuming that for pdfs which have internal fonts, there is no way to get
at the actual text of the font using pdfbox, even as an image? If I know that it's impossible
I can start looking at alternatives to pdfbox...

Patrick Nichols


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message