jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julio Castillo (JIRA)" <j...@apache.org>
Subject [jira] Created: (JCR-1567) IOException while extracting text from PDF
Date Thu, 01 May 2008 19:28:55 GMT
IOException while extracting text from PDF
------------------------------------------

                 Key: JCR-1567
                 URL: https://issues.apache.org/jira/browse/JCR-1567
             Project: Jackrabbit
          Issue Type: Bug
          Components: indexing
    Affects Versions: core 1.4.2
         Environment: Tomcat 6; JDK 1.6; Windows 2003;
            Reporter: Julio Castillo


while trying to upload a PDF document (which I can view fine with Acrobat Reader once it is
loaded) I get the following exception: 

01.05.2008 12:24:44 *WARN * PdfTextExtractor: Failed to extract PDF text content (PdfTextExtractor.java,
line 91)
java.io.IOException: Error: Expected an integer type, actual='%%EOF'
        at org.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1159)
        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:349)
        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:132)
        at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:69)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
 ....

I replaced the version of pdfbox (0.6.4) that is bundled with the jackrabbit war file with
a more recent version (0.7.3 and fontbox 01.) and it worked fine. The bundled versions should
be upgraded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message