jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Improve indexing performance wrt Aggregates
Date Thu, 17 Sep 2009 09:17:36 GMT
On Thu, Sep 17, 2009 at 10:58 AM, Marcel Reutegger
<marcel.reutegger@gmx.net> wrote:
> Hi,
> In general I think such a cache makes sense, but we have to be careful
> when reusing Document instances. They may contain Readers that have a
> state and once consumed will probably be closed and throw an exception
> on the next read.

That is a good point! I think it might be better to not call/use it as
a cache (LRUMap kind of thing), but to have the MultiIndex when it
starts a new transaction, thus at

long transactionId = nextTransactionId++;

reset a alreadyIndexed map in the SearchIndex: Map<String,
WeakReference<Document>> alreadyIndexed

when the  MultiIndex finishes its  'synchronized void update(Iterator
remove, Iterator add)'

we set the alreadyIndexed to null, or clear it. Afaik, this will not
interfere with existing Readers ever, and after the update we flush
the alreadyIndexed map, making sure no Lucene Documents references are
kept. I suggest to use WeakReference in the alreadyIndexed map in case
the update would be really large resulting in many, possible large,
Lucene Documents.

If people like the idea, I'll work on a suggested patch. (i tested
with indexing the jsr-170 spec, it takes around 2 seconds on my
laptop, and the jsr-283 even 3+...you do want to reuse this in
aggregates, making sure that in one update, the indexing is not done
multiple time)

Regards Ard

> regards
>  marcel
> On Wed, Sep 16, 2009 at 14:56, Ard Schrijvers <a.schrijvers@onehippo.com> wrote:
>> Hello,
>> I want to avoid recreating a lucene Document when using aggregates
>> over and over. This particularly holds when using aggregates for
>> binary data like a pdf. I think using aggregates also for binary data
>> is a good usecase. But, indexing and extracting a large pdf is cpu
>> intensive, and, this currently is done multiple times when using
>> aggregates. For non binary data the same holds, but the performance
>> penalty is lower.
>> I would like to add a really simple small cache into the SearchIndex:
>> Something like:
>> private final Map<String, WeakReference<Document>> lastCreatedDocs =
>> new LRUMap(500);
>> (WeakReferences used in case the Lucene Document is really large)
>> Now, in
>> protected Document  createDocument(NodeState node, NamespaceMappings
>> nsMappings, IndexFormatVersion indexFormatVersion)
>> we start with:
>> WeakReference<Document> refDoc =
>> lastCreatedDocs.get(node.getNodeId().getUUID().toString());
>>        Document doc;
>>        if(refDoc != null && (doc = refDoc.get()) != null) {
>>            // We already created a Document for this NodeState within
>> the current transaction. Return it directly.
>>            return doc;
>>        }
>> and before the return statement:
>> lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new
>> WeakReference<Document>(doc));
>> Now, when you have aggregates, then in
>> for mergeAggregatedNodeIndexes(node, doc);
>> you'll get a cached version back of the node to merge. This way,
>> within a *single* indexing transaction, all nodes are lucene indexed
>> once, and not, possibly, many times.
>> Obviously, there is one catch: on the next indexing transaction, we
>> need to flush the lastCreatedDocs cache, as a node might have been
>> changed. If we add a method to the QueryHandler interface, something
>> like
>> void flush()
>> and we call this whenever we are going to index a new set of nodes, I
>> think we should be fine. AFAICS if in MultiIndex
>> synchronized void update(Iterator remove, Iterator add) throws IOException {
>> we do a
>> handler.flush()
>> we are there.
>> What do others think? If people like it, I'll create a patch to test.
>> Regards Ard

View raw message