jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: improving the scalability in searching
Date Fri, 17 Aug 2007 12:30:45 GMT

> Ard Schrijvers wrote:
> > It is crystal clear: When you have old format, you stay in 
> that format, if
> > you start with new index, you get the new format. Clear and 
> implementable
> > IMO. I can give it a try and implement it unless somebody 
> else wants to do
> > it?

> Marcel Reutegger wrote:
> be our guest ;)

I am working on https://issues.apache.org/jira/browse/JCR-1064. Implementing the new _:PROPERTIES_SET
idea is extremely simple and changing the MatchAllScorer is quite trivial too. Performance
gains of factors 10 I get. Not only for the //*[@mytext], but also for

//*[@mytext and @myothertext]
//*[@mytext or @myothertext]

and for quite some more (all parts in LuceneQueryBuilder where MatchAllQuery is used)

But, while adding these quite trivial changes, I realized that the MatchAllScorer AFAICS becomes
superfluous, hence also creating sometimes expensive filters. For example 

//*[@mytext and @myothertext] when I have 10^6 nodes with mytext prop takes like ~100ms (>1
sec for the old MatchAllScorer)

Not using the MatchAllQuery but just (2 times)

query = new TermQuery(new Term(FieldNames.PROPERTIES_SET,field)); 

results in about 15 ms when for example 10^6 nodes have prop 'mytext' and 10^2 have myothertext.
This result scales for many more documents. The current implementation takes > 1 sec at
my computer, and the MatchAllQuery is used for many more usecases.

Since IMO this is such a performance and scalability improvement I want to discuss the backwards
compatability for older jackrabbit releases which have an index which is not suitable for
this new approach. Checking the current index at startup and then fallback to old index style
if no fieldName FieldNames.PROPERTIES_SET is present seems a little "hacky" to me to implement.
What I would like is to enable people to choose between two index types within the searchindex
configuration, something like:

<param name="index-type" value="old"/> old|new

and have this value for all 1.3.x releases set to old, and from the 1.4.0 release, set it
to new. People can then use the 1.4.0 version with the old index type. From 1.4.0 we could
also mark the "MatchAllQuery", "MatchAllScorer" and "MatchAllWeight" as deprecated AFAICS,
but I might be missing something. 

So, WDOT? I really like to push the changes in the 1.4 version, because for *many* nodes,
speedups of more then hundreds of times for certains queries can be seen (some will have factor
10, some factor 2, but all will be faster). 

Regards Ard

> regards
>   marcel

View raw message