jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sébastien Launay <sebastienlau...@gmail.com>
Subject Re: Memory issues with jackrabbit/lucene
Date Tue, 29 Sep 2009 13:29:19 GMT
Le 29/09/2009 14:39, Muguet Bradbury a écrit :
> Just as a reminder, we use jackrabbit 1.4.  I'm not explicitly using the text-extractors.
You do not configure custom text extractors, but i think you have
by default the DefaultTextExtractor [1] which declares the
PlainTextExtractor [2] and the XMLTextExtractor [3].

AFAIK with this configuration only the nodes with a property
jcr:data and a property jcr:mimeType with one of the following
values will be indexed:
- text/plain using PlainTextExtractor.
- text/xml using XMLTextExtractor.
- application/xml using XMLTextExtractor.

> Our customers are alleviating the memory problem by restarting the servers daily.
> The documents we store are numerous (thousands daily) and vary in size.  They are news
articles (xml/html) and reports (rtf) and are all stored as binary content (base64 encoded).
 We also store some attributes about these articles that are in string format.  We delete
thousands of news articles per day when reports are finalized.  We do not need to be able
to search the content of these articles - but I assume they are being indexed because we have
specified SearchIndex elements in our repository xml.
> Am I correct here?
That's right having the SearchIndex elements (AFAIK the main one
is used for versions indexing) will index nodes and therefore binary

Maybe the reason is using a base64 encoding because extraction
processing assume the content is in raw form and tokenizing a
base64 stream may, in my understanding, creates only one big
token if there is no linefeed...

Another solution is to use the following parameter in order to limit
the number of simultaneous extraction processing by using a pool
of threads:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
  <param name="path" value="${rep.home}/repository/index" />
  <param name="extractorPoolSize" value="4" />

By using info log level on category org.apache.jackrabbit.extractor
you may find more useful informations.


Sébastien Launay

View raw message