jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: [jr3] Search index in content
Date Thu, 18 Feb 2010 09:23:51 GMT
On Thu, Feb 18, 2010 at 8:39 AM, Thomas Müller <thomas.mueller@day.com> wrote:

> The fulltext index is (potentially) slow, specially fulltext
> extraction. Therefore, fulltext index should be done asynchronously if

would this be in line with the spec?

> it takes too long. Also, in a clustered environment, at least text
> extraction should only be done in one cluster node. I would still use
> Apache Tika and Apache Lucene for this.

Especially pdf extraction can kill the performance of an entire
cluster. As pdfs can be part of a document at our structure, where it
needs to be nodescope indexed every time the document is saved again,
we use an approach to store as binary (to use the DataStore) version
an extracted version of the pdf and index this extracted version: Only
one node in the cluster will now do the extraction, only one user is
blocked. The other nodes just index the extracted text version, which
is quite fast. Not sure if we should have this kind of option part of

regards Ard

> Regards,
> Thomas

View raw message