pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Strong <craigstr...@yahoo.com>
Subject Extracting text from PDF with no embedded fonts
Date Mon, 10 Mar 2014 20:19:03 GMT
I have been using PDFBox to extract text from several different PDF files fine.  I use the
latest PDFBox app with ExtractText class.  There is one PDF that PDFBox (and iText) fails
to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe
part of XPdf.  I don't want to have to rely on using pdftotext.exe from a PC since this is
part of an automated application.  I think the error relates to an unknown font type and
having to use the few fonts installed in the jar file.  I tried running the API classes
and trying to force a font from a certain location but I still got errors.  I thought I loaded
the font with the loadTTF method but I don't know if that did anything with the font.  I
would really like to have this working straight from the ExtractText class anyway.  I'm thinking
I might have to build my own after putting a bunch of Windows fonts somewhere and changing
a properties file but I really don't know
 if that is the right direction I should be taking and I am new to PDFBox.  Any ideas?
Here are the errors I am getting.  I tried this from both a Windows PC and our system but
I get the same errors.  The section starting processEncodedText and on repeats a few times
so I just included the first entries.
 
Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont                          

WARNING: Substituting TrueType for unknown font subtype=                                                 

Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator                           

WARNING: java.lang.NullPointerException                                                                  

Throwable occurred: java.lang.NullPointerException                                                       

        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)   

        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119)   

        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) 

        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)            

        at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)       

        at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)        

        at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)

        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)  

        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)    

        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)    
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)      

        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)             

        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)                         

        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                                   

Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText          

WARNING: java.lang.NullPointerException                                                    

Throwable occurred: java.lang.NullPointerException                                           

        at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)                

        at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)  

        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) 

        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) 

        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)    

        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)      

        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)     

        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)        

        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)               

        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)                           

        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                                     

Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator               

WARNING: java.lang.NullPointerException                                                      

Throwable occurred: java.lang.NullPointerException                                           

        at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)                

        at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)  

        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) 

        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) 

        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)    

        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)      

        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)     

        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)        

        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)               

        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)                           

        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                                     


Thanks,
Craig Strong
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message