Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 15595 invoked from network); 8 Aug 2007 14:17:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Aug 2007 14:17:19 -0000 Received: (qmail 73350 invoked by uid 500); 8 Aug 2007 14:17:16 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 73326 invoked by uid 500); 8 Aug 2007 14:17:16 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 73307 invoked by uid 99); 8 Aug 2007 14:17:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2007 07:17:16 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [213.133.33.40] (HELO smtp.is.nl) (213.133.33.40) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2007 14:17:06 +0000 Received: from [213.133.51.241] (HELO hai01.hippo.local) by smtp.is.nl (CommuniGate Pro SMTP 5.0.10) with ESMTP id 21033677 for dev@jackrabbit.apache.org; Wed, 08 Aug 2007 16:16:47 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.0.6619.12 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: improving the scalability in searching Date: Wed, 8 Aug 2007 16:16:47 +0200 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: improving the scalability in searching Thread-Index: AcfZxsFBEDEe+x+JTyii8Lxrd7bYWw== From: "Ard Schrijvers" To: X-Virus-Checked: Checked by ClamAV on apache.org Hello, As mentioned in https://issues.apache.org/jira/browse/JCR-1051 I think = there might be some optimization in scalability and performance in some = parts of the current lucene implementation. For now, I have two major performance/scalability concerns in the = current indexing/searching implementation : 1) The XPath implementation for //*[@mytext] (sql same problem) 2) The XPath jcr:like implementation, for example : = //*[jcr:like(@mytext,'%foo bar qu%')] Problem 1): //*[@mytext] is transformed into the = org.apache.jackrabbit.core.query.lucene.MatchAllQuery, that through the = MatchAllWeight uses the MatchAllScorer. In this MatchAllScorer there is = a calculateDocFilter() that IMO does not scale. Suppose, I have 100.000 = nodes with a property 'title'. Suppose there are no duplicate titles (or = few). Now, suppose I have XPath /rootnode/articles/2007[@mytitle]. Then, the = while loop in calculateDocFilter() is done 100.000 times (See code = below). 100.000 times=20 terms.term().text().startsWith(FieldNames.createNamedValue(field, "")=20 docs.seek(terms)=20 docFilter.set(docs.doc()); This scales linearly AFAIU, and becomes slow pretty fast (I can add a = unit test that shows this, but on my modest machine I see for 100.000 = nodes searches take already 400 ms with a cached reader, while it can = easily be 0 ms IIULC : "if i understand lucene correcly" :-) ).=20 Solution 1): IMO, we should index more (derived) data about a documents properties = (I'll return to this in a mail about IndexingConfiguration which I think = we can add some features that might tackle this) if we want to be able = to query fast. For this specific problem, the solution would be very = simple: Beside=20 /** * Name of the field that contains all values of properties that are = indexed * as is without tokenizing. Terms are prefixed with the property = name. */ public static final String PROPERTIES =3D "_:PROPERTIES".intern(); I suggest to add=20 /** * Name of the field that contains all available properties that = present for a certain node */ public static final String PROPERTIES_SET =3D = "_:PROPERTIES_SET".intern(); and when indexing a node, each property name of that node is added to = its index (few lines of code in NodeIndexer):=20 Then, when searching for all nodes that have a property, is one single = docs.seek(terms); and set the docFilter. This approach scales to = millions of documents easily with times close to 0 ms. WDOT? Ofcourse, I = can implement this in the trunk. I will do problem (2) in a next mail because my mail is getting a little = long, Regards Ard --------------------------------------------- TermEnum terms =3D reader.terms(new Term(FieldNames.PROPERTIES, = FieldNames.createNamedValue(field, ""))); try { TermDocs docs =3D reader.termDocs(); try { while (terms.term() !=3D null && terms.term().field() =3D=3D = FieldNames.PROPERTIES && = terms.term().text().startsWith(FieldNames.createNamedValue(field, ""))) = { docs.seek(terms); counter++; while (docs.next()) { docFilter.set(docs.doc()); } terms.next(); } } finally { docs.close(); } } finally { terms.close(); } --------------------------------------------- --=20 Hippo Oosteinde 11 1017WT Amsterdam The Netherlands Tel +31 (0)20 5224466 ------------------------------------------------------------- a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl --------------------------------------------------------------=20