jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From S├ębastien Launay <sebastienlau...@gmail.com>
Subject Re: How can I access to the TextExtractor result?
Date Tue, 24 Nov 2009 18:56:49 GMT
I you  to get their hands dirty

2009/11/24 Paco Avila <monkiki@gmail.com>:
> Thanks, this is the expected answer :(
>
> Anyway, there is any way to detect a failed text extraction ? I know,
> I can see the log but the failure it not associated to a file or path.
>
> Some times when I upload a document (word, pdf, etc.) to my DMS build
> on Jackrabbit, it is not indexed. Office documents seems to be
> specially problematic due to its propietary format. And the problem is
> that I don't know which document had problems it their text
> extraction, specially if use extractorPoolSize > 1.
>
> Perhaps this question should be send to the development list? I thinks
> this can be a very useful improvement to Jackrabbit.
>
> On Tue, Nov 24, 2009 at 5:50 PM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
>> Hi,
>>
>> On Tue, Nov 24, 2009 at 5:37 PM, Paco Avila <monkiki@gmail.com> wrote:
>>> I wonder if I can access the text produced by the TextExtractor from a
>>> document file (like a PDF, for example)
>>
>> Jackrabbit doesn't store the extracted text anywhere, it is just used
>> to add the document to the inverted Lucene index.
>>
>> You can always use the text extractor directly to get the text
>> content. Check out http://lucene.apache.org/tika/ for more details
>> about the Tika toolkit that we nowadays use for text extraction.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
>
>
> --
> Paco Avila
> OpenKM
> http://www.openkm.com
> http://www.guia-ubuntu.org
>



-- 
S├ębastien Launay

Mime
View raw message