jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paco Avila <monk...@gmail.com>
Subject Re: How can I access to the TextExtractor result?
Date Tue, 24 Nov 2009 18:12:16 GMT
Thanks, this is the expected answer :(

Anyway, there is any way to detect a failed text extraction ? I know,
I can see the log but the failure it not associated to a file or path.

Some times when I upload a document (word, pdf, etc.) to my DMS build
on Jackrabbit, it is not indexed. Office documents seems to be
specially problematic due to its propietary format. And the problem is
that I don't know which document had problems it their text
extraction, specially if use extractorPoolSize > 1.

Perhaps this question should be send to the development list? I thinks
this can be a very useful improvement to Jackrabbit.

On Tue, Nov 24, 2009 at 5:50 PM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
> On Tue, Nov 24, 2009 at 5:37 PM, Paco Avila <monkiki@gmail.com> wrote:
>> I wonder if I can access the text produced by the TextExtractor from a
>> document file (like a PDF, for example)
> Jackrabbit doesn't store the extracted text anywhere, it is just used
> to add the document to the inverted Lucene index.
> You can always use the text extractor directly to get the text
> content. Check out http://lucene.apache.org/tika/ for more details
> about the Tika toolkit that we nowadays use for text extraction.
> BR,
> Jukka Zitting

Paco Avila

View raw message