jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julio Castillo (JIRA)" <j...@apache.org>
Subject [jira] Created: (JCR-1567) IOException while extracting text from PDF
Date Thu, 01 May 2008 19:28:55 GMT
IOException while extracting text from PDF

                 Key: JCR-1567
                 URL: https://issues.apache.org/jira/browse/JCR-1567
             Project: Jackrabbit
          Issue Type: Bug
          Components: indexing
    Affects Versions: core 1.4.2
         Environment: Tomcat 6; JDK 1.6; Windows 2003;
            Reporter: Julio Castillo

while trying to upload a PDF document (which I can view fine with Acrobat Reader once it is
loaded) I get the following exception: 

01.05.2008 12:24:44 *WARN * PdfTextExtractor: Failed to extract PDF text content (PdfTextExtractor.java,
line 91)
java.io.IOException: Error: Expected an integer type, actual='%%EOF'
        at org.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1159)
        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:349)
        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:132)
        at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:69)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)

I replaced the version of pdfbox (0.6.4) that is bundled with the jackrabbit war file with
a more recent version (0.7.3 and fontbox 01.) and it worked fine. The bundled versions should
be upgraded.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message