jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: Something funny in the text extraction of PDFs [SEC=UNCLASSIFIED]
Date Tue, 19 Oct 2010 07:47:14 GMT
Hi,

On Tue, Oct 19, 2010 at 9:33 AM, Ard Schrijvers
<a.schrijvers@onehippo.com> wrote:
> On Tue, Oct 19, 2010 at 7:11 AM,  <Ross.Dyson@ipaustralia.gov.au> wrote:
>> org.apache.pdfbox.cos.COSDocument - java.lang.ClassCastException
>> Oct 19 13:13:40 localhost java.lang.ClassCastException
>
> Think you might better want to (cross) post this question to pdfbox as
> Jackrabbit uses pdfbox for pdf extraction, and also the exception
> points to pdfbox,

Exactly. PDFBox is having problems extracting text from some of your
documents. In such cases Jackrabbit simply logs the problem as a
warning (so you know about it) and indexes the document as if it were
empty.

In general these warnings aren't too harmful, but it would be good if
you could post the full stack trace to the PDFBox issue tracker at
https://issues.apache.org/jira/browse/PDFBOX, preferably with an
example PDF that causes the problem. That way the PDFBox team will
have a chance to fix the problem and sooner or later also Jackrabbit
will be able to better index your documents.

BR,

Jukka Zitting

Mime
View raw message