jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (JCR-1567) Upgrade to PDFBox 0.7.3
Date Fri, 30 May 2008 15:06:45 GMT

     [ https://issues.apache.org/jira/browse/JCR-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting updated JCR-1567:
-------------------------------

    Affects Version/s:     (was: core 1.4.2)
                       1.4
        Fix Version/s: 1.5
             Assignee: Jukka Zitting
           Issue Type: Improvement  (was: Bug)
              Summary: Upgrade to PDFBox 0.7.3  (was: IOException while extracting text from
PDF)

Classifying this as an improvement, as the bug is in PDFBox and not in Jackrabbit. The Jackrabbit
improvement would be the upgrade to PDFBox 0.7.3.

PS. PDFBox is currently incubating to become an Apache project, so there's still life there.

> Upgrade to PDFBox 0.7.3
> -----------------------
>
>                 Key: JCR-1567
>                 URL: https://issues.apache.org/jira/browse/JCR-1567
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-text-extractors
>    Affects Versions: 1.4
>         Environment: Tomcat 6; JDK 1.6; Windows 2003;
>            Reporter: Julio Castillo
>            Assignee: Jukka Zitting
>             Fix For: 1.5
>
>
> while trying to upload a PDF document (which I can view fine with Acrobat Reader once
it is loaded) I get the following exception: 
> 01.05.2008 12:24:44 *WARN * PdfTextExtractor: Failed to extract PDF text content (PdfTextExtractor.java,
line 91)
> java.io.IOException: Error: Expected an integer type, actual='%%EOF'
>         at org.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1159)
>         at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:349)
>         at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:132)
>         at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:69)
>         at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
>         at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
>         at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
>  ....
> I replaced the version of pdfbox (0.6.4) that is bundled with the jackrabbit war file
with a more recent version (0.7.3 and fontbox 01.) and it worked fine. The bundled versions
should be upgraded.
> On the other hand, this software appears to be inactive. Probably a different package
should be selected in the long run, but for now, a simple upgrade will do the trick.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message