pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Strong <craigstr...@yahoo.com>
Subject Re: Extracting text from PDF with no embedded fonts
Date Fri, 14 Mar 2014 17:08:13 GMT
Hi, I used PDFBox 1.8.4.  I went ahead and created an issue with JIRA and uploaded the PDF
file there.  I used most of my original email text.
 
Thanks,
Craig
 

________________________________
 From: Tilman Hausherr <THausherr@t-online.de>
To: users@pdfbox.apache.org 
Sent: Friday, March 14, 2014 2:52 AM
Subject: Re: Extracting text from PDF with no embedded fonts
  

Hi,

The best would be to create an issue with JIRA and upload the file there, if it isn't confidential.

Re "the latest", did you use an 1.8 version or a 2.0 version?

Tilman

Am 10.03.2014 21:19, schrieb Craig Strong:
> I have been using PDFBox to extract text from several different PDF files fine.  I use
the latest PDFBox app with ExtractText class.  There is one PDF that PDFBox (and iText) fails
to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe
part of XPdf.  I don't want to have to rely on using pdftotext.exe from a PC since this is
part of an automated application.  I think the error relates to an unknown font type and
having to use the few fonts installed in the jar file.  I tried running the API classes and
trying to force a font from a certain location but I still got errors.  I thought I loaded
the font with the loadTTF method but I don't know if that did anything with the font.  I
would really like to have this working straight from the ExtractText class anyway.  I'm thinking
I might have to build my own after putting a bunch of Windows fonts somewhere and changing
a properties file but I really don't know
>   if that is the right direction I should be taking and I am new to PDFBox.  Any ideas?
> Here are the errors I am getting.  I tried this from both a Windows PC and our system
but I get the same errors.  The section starting processEncodedText and on repeats a few
times so I just included the first entries.
>   Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont
> WARNING: Substituting TrueType for unknown font subtype=
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> WARNING: java.lang.NullPointerException
> Throwable occurred: java.lang.NullPointerException
>          at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
>          at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
>          at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119)
>          at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>          at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)
>          at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)
>          at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)
>          at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>          at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>          at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>          at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>          at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
>          at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
>          at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>          at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
>          at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>          at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText
> WARNING: java.lang.NullPointerException
> Throwable occurred: java.lang.NullPointerException
>          at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
>          at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
>          at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>          at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>          at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>          at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>          at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
>          at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
>          at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>          at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
>          at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>          at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
> WARNING: java.lang.NullPointerException
> Throwable occurred: java.lang.NullPointerException
>          at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)
>          at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
>          at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>          at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>          at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>          at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>          at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
>          at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
>          at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>          at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
>          at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>          at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
> 
> Thanks,
> Craig Strong
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message