jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Kiehl <christ...@sulu3000.de>
Subject Re: Commented: (JCR-791) Cache BitSet per IndexReader in MatchAllScorer
Date Fri, 16 Mar 2007 16:08:37 GMT
Marcel Reutegger (JIRA) wrote:

> Here's what I've done so far:
> - Introduced a MultiIndexReader interface that allows to access the sub index readers.
> - CachingMultiReader and SearchIndex.CombinedIndexReader now implement MultiIndexReader
> - Created a MultiScorer which spans multiple sub scorers and combines. The MultiScorer
exposes the sub scorers as if there is just a single scorer.
> - Changed MatchAllWeight to create individual scorers for each sub IndexReader contained
in a MultiIndexReader and finally combines them into a MultiScorer.
> - Introduced a BitSet cache in MatchAllScorer

Great. Thanks a lot!

> I then conducted the following tests:
> Setup:
> - 50'000 nodes
> - resultFetchSize: 50
> - respectDocumentOrder: false
> 100 queries: //element(*, nt:unstructured)[@foo]
>  (only size of NodeIterator is read, no node access)
> Results:
> 1) with jackrabbit 1.2.3:
>     82078 ms
> 2) with MatchAllScorer per index segment
>   combined with MultiScorer without caching:
>     10297 ms
> 3) with MatchAllScorer per index segment
>   combined with MultiScorer with caching:
>      6156 ms
> My conclusion is that the the lucene MultiTermDocs implementation adds significant cost
when a single MatchAllScorer is used in test scenario 1). And it actually makes sense. If
a single MatchAllScorer is used, lucene has to merge sort the @foo terms of several index
segments, while in the test scenarios 2) and 3) no merge sort is needed for the @foo terms.
> With the changes the query performance even seems good enough even without caching. 
> I'm tempted to only check the changes without caching because the additional performance
improvement with caching does not seem to warrant the memory consumption of the cache: 2)
decreases the query time compared to the current implementation by 87% while 3) decreases
query time by 92%.

The effect of caching should increase if you use queries which test an attribute 
more than once, like:

//element(*, nt:unstructured)[@foo!='1' or @foo!='2' or @foo!='3']

May be we can add a configuration option to SearchIndex which allows to enable 
caching? This, way one can choose if his/her focus is on memory or on processing 
time. We have a situation for example where a lot of memory is available but 
processing time is the bottleneck.

Would you mind sharing a patch for the caching you implemented? Do you may be 
even have a testcase which generates this test repository? I could do some 
further tests here ..


View raw message