pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Created: (PDFBOX-350) Null pointer exception during text extraction
Date Mon, 04 Aug 2008 17:54:44 GMT
Null pointer exception during text extraction
---------------------------------------------

                 Key: PDFBOX-350
                 URL: https://issues.apache.org/jira/browse/PDFBOX-350
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: Jukka Zitting


[Issue from SourceForge]
http://sourceforge.net/tracker/index.php?func=detail&aid=1934566&group_id=78314&atid=552832

Parsing the following document from the US gov website
http://www.ssa.gov/multilanguage/Arabic/10101-AR.pdf

Exception in thread "main" java.lang.NullPointerException
at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:360)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.ExtractText.main(ExtractText.java:244)

This is caused by an unchecked reference c.equals( " " ) in line 377 of
PDFStreamEngine.java

changing this line to
if( (string[i] == 0x20) && c != null && c.equals( " " ) )

eliminates the null pointer de-ref, but the output contains many ugly
embedded nulls, which might be seen here as an excerpt

يف اوشاع اذإ ماعطلا عباوطل نيلهؤملا
بناجلأا ،يلي اميف �������null���
ماعطلا عباوط جمانرب دعاسي

in one case the word null is printed several dozen times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message