jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: LazyTextExtractorField and background text extraction
Date Thu, 16 Jul 2009 09:51:03 GMT
Hi,

On Thu, Jul 16, 2009 at 11:32, Jukka Zitting<jukka.zitting@gmail.com> wrote:
> On Thu, Jul 16, 2009 at 11:04 AM, Marcel
> Reutegger<marcel.reutegger@gmx.net> wrote:
>> hmm, even if the conversion from reader to string is done in a
>> separate thread as part of the extractor job, there remains the issue
>> when the reader is used as is.
>
> As far as I can tell from the code, this is currently not the case as
> all the binary values get wrapped into LazyTextExtractorFields.

that's correct. I meant, it would be a general problem, even if we
changed the way LazyTextExtractorField works. i.e. if it would return
the reader in case the content is not stored in the index.

>> we'd have to change the way how the indexer finds out whether the
>> extractor times out.
>
> Would it help if we added an unlimited buffering mechanism (backed by
> temporary files as needed) to the Readers so that if the indexer gets
> blocked extracting text from one document, all the other pending
> documents can automatically continue text extraction in parallel? This
> might cause occasional blocking in the indexer, but on the average it
> should do about as well as maintaining an explicit indexing queue.

I'm not sure I understand that correctly. with the current design
multiple nodes are already indexed in parallel. but the index update
as a whole will still be blocked, waiting for *all* nodes to be
indexed.

the indexing queue is meant to takes over long running text
extractions and do that work outside of the index update, instead of
indexing the real content, the timed out text extracts are replaced
with dummy values. a new index update is done when the extraction has
finished (this is currently detected by an available reader from the )
with the complete text extract.

there is a configuration parameter extractorTimeout which limits the
amount of time spent in extracting text (or waiting for that to
happen). I think it must be possible to configure the repository to
never block on text extractions. it is vital because jackrabbit
currently only supports one writing transaction at a time and the
indexing is part of that transaction.

regards
 marcel

> In fact if we did this in Tika, we could avoid the extra buffering
> entirely for things like plain text documents and other formats where
> the parsing overhead is negligible.
>
> BR,
>
> Jukka Zitting
>

Mime
View raw message