jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject Search performance : MultiIndex
Date Fri, 26 Oct 2007 07:10:33 GMT
Hello,

I have some problems with performance of searches in jackrabbit. I have
a simple search, like, give me all nodes where (prop1=a +prop2=b
+prop3=c + prop4=d). This is for lucene obviously an extremely simple
query. Doing this on a lucene index with millions of docs and the number
of hits is  small (< 100) will result in queries executed in couple of
ms. 

When having these kind of queries in Jackrabbit, with for example
100.000 nodes and I do the above described search (repeated), results in
*slow* responses (couple of hundred of ms for 100.000 nodes only). I did
ask on the lucene list what impact a MultiSearcher (I know we use a
CombinedIndexReader and a normal IndexSearcher, though I am quite
convinced the problem stays the same) has on performance with respect to
a single index. I got only one answer, but a search of say 100 indexes
would take 100 times longer (which I kind am experiencing when the
number of actual hits is small). 

I wrote a seperate programm to do some testing, like merging the
jackrabbit indexes into one single index. Then, my queries are fast. The
original reason for multiple indexes is I think to be able to keep more
indexReaders open and cache the results, and have easier/faster
incremental updating, right? Also see [1]. Also the thread in [2]
between Christoph and Marcel might be pretty much related to this
(RangeQueries I did not test, but intuitive they will suffer even more
from multiple indexes, because each index has to expand the RangeQuery
seperately I think). The problem with slow DescendantSelfAxisWeight
won't be solved, though I did some changes in our code to be able to
know fast wether a node is a child of some node or not (if people
interested, I have been thinking about this one, and it is a trade off
between fast renaming in jackrabbit of a node, or fast searching for
child nodes (write versus read))

Before I will try to see what can be changed, do other people experience
the same thing? 
Might it be someting that was faster at the time of lucene 1.9, but is
now perhaps outdated? 

I also found some parts on FileSystem access for multiple indexes is
slower, because head movements during reading might be much larger
compared to a single index (though might be platform dependant of course
how the FileSystem cache is managed). 

To start with, I have tried to keep the number of indexes created as
small as possible tuning the minMergeDocs, volatileIdleTime,
maxMergeDocs and mergeFactor. Whenever my number of documents/nodes grow
however (though only 100.000 nodes), my number of indexes grow. 

I think the idea about seperate indexes if perfectly valid, only I want
to reduce the number of indexes to no more then for example 10. Adding
each VolatileIndex when persisting it, to an already persistent index
untill for example the index contains 100.000 docs, and then, when there
are 10 of them, merge them all, and start creating indexes of 1.000.000
docs, untill there are 10, would perhaps benefit of both worlds. 

WDOT? Do other people experience the same problems? I do not know how
other people use JackRabbit, but the way I want to use it mainly
consists of searching. Almost everything I do is a search. Building a
website with JackRabbit as content store results in queries all over the
place, where currently, some are IMHO to slow, and where some aren't
even possible within reasonable time scales (like, give me the most
recent 10 articles in /content/en/news//*[@type='news'] because this
will result in a ChildAxisQuery or DescendantSelfAxisQuery which cannot
be done over millions of documents AFAICS. To solve this at my setup, I
choose to index the path of a document, where I do realize that moving a
node now becomes expensive regarding re-indexing)

Hope to hear what you think about it,

Regards Ard

[1] http://jackrabbit.apache.org/doc/arch/operate/query.html#Query
[2] http://www.mail-archive.com/dev@jackrabbit.apache.org/msg06026.html

-- 

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
-------------------------------------------------------------- 

Mime
View raw message