jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject improving the scalability in searching
Date Wed, 08 Aug 2007 14:16:47 GMT

As mentioned in https://issues.apache.org/jira/browse/JCR-1051 I think there might be some
optimization in scalability and performance in some parts of the current lucene implementation.

For now, I have two major performance/scalability concerns in the current indexing/searching
implementation :

1) The XPath implementation for //*[@mytext] (sql same problem)
2) The XPath jcr:like implementation, for example : //*[jcr:like(@mytext,'%foo bar qu%')]

Problem 1):

//*[@mytext] is transformed into the org.apache.jackrabbit.core.query.lucene.MatchAllQuery,
that through the MatchAllWeight uses the MatchAllScorer. In this MatchAllScorer there is a
calculateDocFilter() that IMO does not scale. Suppose, I have 100.000 nodes with a property
'title'. Suppose there are no duplicate titles (or few).

Now, suppose I have XPath /rootnode/articles/2007[@mytitle]. Then, the while loop in calculateDocFilter()
is done 100.000 times (See code below). 100.000 times 

terms.term().text().startsWith(FieldNames.createNamedValue(field, "") 

This scales linearly AFAIU, and becomes slow pretty fast (I can add a unit test that shows
this, but on my modest machine I see for 100.000 nodes searches take already 400 ms with a
cached reader, while it can easily be 0 ms IIULC  : "if i understand lucene correcly" :-)

Solution 1):

IMO, we should index more (derived) data about a documents properties (I'll return to this
in a mail about IndexingConfiguration which I think we can add some features that might tackle
this) if we want to be able to query fast. For this specific problem, the solution would be
very simple:


     * Name of the field that contains all values of properties that are indexed
     * as is without tokenizing. Terms are prefixed with the property name.
    public static final String PROPERTIES = "_:PROPERTIES".intern();

I suggest to add 

     * Name of the field that contains all available properties that present for a certain
    public static final String PROPERTIES_SET = "_:PROPERTIES_SET".intern();

and when indexing a node, each property name of that node is added to its index (few lines
of code in NodeIndexer): 

Then, when searching for all nodes that have a property, is one single docs.seek(terms); and
set the docFilter. This approach scales to millions of documents easily with times close to
0 ms. WDOT? Ofcourse, I can implement this in the trunk.

I will do problem (2) in a next mail because my mail is getting a little long,

Regards Ard

TermEnum terms = reader.terms(new Term(FieldNames.PROPERTIES, FieldNames.createNamedValue(field,
        try {
            TermDocs docs = reader.termDocs();
            try {
                while (terms.term() != null
                        && terms.term().field() == FieldNames.PROPERTIES
                        && terms.term().text().startsWith(FieldNames.createNamedValue(field,
""))) {
                    while (docs.next()) {
            } finally {
        } finally {



Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl

View raw message