Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 60599 invoked from network); 17 Aug 2007 12:31:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Aug 2007 12:31:36 -0000 Received: (qmail 77675 invoked by uid 500); 17 Aug 2007 12:31:33 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 77641 invoked by uid 500); 17 Aug 2007 12:31:33 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 77632 invoked by uid 99); 17 Aug 2007 12:31:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Aug 2007 05:31:32 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [213.133.33.40] (HELO smtp.is.nl) (213.133.33.40) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Aug 2007 12:31:52 +0000 Received: from [213.133.51.241] (HELO hai01.hippo.local) by smtp.is.nl (CommuniGate Pro SMTP 5.0.10) with ESMTP id 21758797 for dev@jackrabbit.apache.org; Fri, 17 Aug 2007 14:31:03 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.0.6619.12 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: improving the scalability in searching Date: Fri, 17 Aug 2007 14:30:45 +0200 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: improving the scalability in searching Thread-Index: AcffQHtXvdvEW1UnRymdmf3rCxSuUQBhX3jQ From: "Ard Schrijvers" To: X-Virus-Checked: Checked by ClamAV on apache.org > Ard Schrijvers wrote: > > It is crystal clear: When you have old format, you stay in=20 > that format, if > > you start with new index, you get the new format. Clear and=20 > implementable > > IMO. I can give it a try and implement it unless somebody=20 > else wants to do > > it? > Marcel Reutegger wrote: > be our guest ;) I am working on https://issues.apache.org/jira/browse/JCR-1064. = Implementing the new _:PROPERTIES_SET idea is extremely simple and = changing the MatchAllScorer is quite trivial too. Performance gains of = factors 10 I get. Not only for the //*[@mytext], but also for //*[@mytext and @myothertext] //*[@mytext or @myothertext] //*[not(@mytext)] //*[@mytext!=3D'foo']=20 and for quite some more (all parts in LuceneQueryBuilder where = MatchAllQuery is used) But, while adding these quite trivial changes, I realized that the = MatchAllScorer AFAICS becomes superfluous, hence also creating sometimes = expensive filters. For example=20 //*[@mytext and @myothertext] when I have 10^6 nodes with mytext prop = takes like ~100ms (>1 sec for the old MatchAllScorer) Not using the MatchAllQuery but just (2 times) query =3D new TermQuery(new Term(FieldNames.PROPERTIES_SET,field));=20 results in about 15 ms when for example 10^6 nodes have prop 'mytext' = and 10^2 have myothertext. This result scales for many more documents. = The current implementation takes > 1 sec at my computer, and the = MatchAllQuery is used for many more usecases. Since IMO this is such a performance and scalability improvement I want = to discuss the backwards compatability for older jackrabbit releases = which have an index which is not suitable for this new approach. = Checking the current index at startup and then fallback to old index = style if no fieldName FieldNames.PROPERTIES_SET is present seems a = little "hacky" to me to implement. What I would like is to enable people = to choose between two index types within the searchindex configuration, = something like: old|new and have this value for all 1.3.x releases set to old, and from the = 1.4.0 release, set it to new. People can then use the 1.4.0 version with = the old index type. From 1.4.0 we could also mark the "MatchAllQuery", = "MatchAllScorer" and "MatchAllWeight" as deprecated AFAICS, but I might = be missing something.=20 So, WDOT? I really like to push the changes in the 1.4 version, because = for *many* nodes, speedups of more then hundreds of times for certains = queries can be seen (some will have factor 10, some factor 2, but all = will be faster).=20 Regards Ard >=20 > regards > marcel >=20