Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: improving  the scalability in searching 
Date: Wed, 8 Aug 2007 16:16:47 +0200
Message-ID: <A955EA1F8FE31749AEC8C998082F6C7C01ECDB77@hai01.hippo.local>
Thread-Topic: improving  the scalability in searching 
Thread-Index: AcfZxsFBEDEe+x+JTyii8Lxrd7bYWw==
From: "Ard Schrijvers" <a.schrijvers@hippo.nl>
To: <dev@jackrabbit.apache.org>

Hello,

As mentioned in https://issues.apache.org/jira/browse/JCR-1051 I think =
there might be some optimization in scalability and performance in some =
parts of the current lucene implementation.

For now, I have two major performance/scalability concerns in the =
current indexing/searching implementation :

1) The XPath implementation for //*[@mytext] (sql same problem)
2) The XPath jcr:like implementation, for example : =
//*[jcr:like(@mytext,'%foo bar qu%')]

Problem 1):

//*[@mytext] is transformed into the =
org.apache.jackrabbit.core.query.lucene.MatchAllQuery, that through the =
MatchAllWeight uses the MatchAllScorer. In this MatchAllScorer there is =
a calculateDocFilter() that IMO does not scale. Suppose, I have 100.000 =
nodes with a property 'title'. Suppose there are no duplicate titles (or =
few).

Now, suppose I have XPath /rootnode/articles/2007[@mytitle]. Then, the =
while loop in calculateDocFilter() is done 100.000 times (See code =
below). 100.000 times=20

terms.term().text().startsWith(FieldNames.createNamedValue(field, "")=20
docs.seek(terms)=20
docFilter.set(docs.doc());

This scales linearly AFAIU, and becomes slow pretty fast (I can add a =
unit test that shows this, but on my modest machine I see for 100.000 =
nodes searches take already 400 ms with a cached reader, while it can =
easily be 0 ms IIULC  : "if i understand lucene correcly" :-)  ).=20

Solution 1):

IMO, we should index more (derived) data about a documents properties =
(I'll return to this in a mail about IndexingConfiguration which I think =
we can add some features that might tackle this) if we want to be able =
to query fast. For this specific problem, the solution would be very =
simple:

Beside=20

    /**
     * Name of the field that contains all values of properties that are =
indexed
     * as is without tokenizing. Terms are prefixed with the property =
name.
     */
    public static final String PROPERTIES =3D "_:PROPERTIES".intern();

I suggest to add=20

    /**
     * Name of the field that contains all available properties that =
present for a certain node
     */
    public static final String PROPERTIES_SET =3D =
"_:PROPERTIES_SET".intern();

and when indexing a node, each property name of that node is added to =
its index (few lines of code in NodeIndexer):=20

Then, when searching for all nodes that have a property, is one single =
docs.seek(terms); and set the docFilter. This approach scales to =
millions of documents easily with times close to 0 ms. WDOT? Ofcourse, I =
can implement this in the trunk.

I will do problem (2) in a next mail because my mail is getting a little =
long,

Regards Ard

---------------------------------------------
TermEnum terms =3D reader.terms(new Term(FieldNames.PROPERTIES, =
FieldNames.createNamedValue(field, "")));
        try {
            TermDocs docs =3D reader.termDocs();
            try {
                while (terms.term() !=3D null
                        && terms.term().field() =3D=3D =
FieldNames.PROPERTIES
                        && =
terms.term().text().startsWith(FieldNames.createNamedValue(field, ""))) =
{
                    docs.seek(terms);
                    counter++;
                    while (docs.next()) {
                        docFilter.set(docs.doc());
                    }
                    terms.next();
                }
            } finally {
                docs.close();
            }
        } finally {
            terms.close();
        }

---------------------------------------------

--=20

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
--------------------------------------------------------------=20