From oak-dev-return-2624-apmail-jackrabbit-oak-dev-archive=jackrabbit.apache.org@jackrabbit.apache.org Wed Sep 19 20:40:28 2012 Return-Path: X-Original-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F27B3D507 for ; Wed, 19 Sep 2012 20:40:28 +0000 (UTC) Received: (qmail 53351 invoked by uid 500); 19 Sep 2012 20:40:28 -0000 Delivered-To: apmail-jackrabbit-oak-dev-archive@jackrabbit.apache.org Received: (qmail 53292 invoked by uid 500); 19 Sep 2012 20:40:28 -0000 Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: oak-dev@jackrabbit.apache.org Delivered-To: mailing list oak-dev@jackrabbit.apache.org Received: (qmail 53284 invoked by uid 99); 19 Sep 2012 20:40:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Sep 2012 20:40:28 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jukka.zitting@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vc0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Sep 2012 20:40:23 +0000 Received: by vcbfk26 with SMTP id fk26so3372964vcb.1 for ; Wed, 19 Sep 2012 13:40:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=D/uhtot0D5e60Qh+E6HSDCaHzxEdITNjQdxInGcky7E=; b=e0rVW5UzJcSHDD9oZBLVn3Vu6miSerQH32wzNgoA9O8/25ofRkq7t0H4YsK7KvEB14 9IC9PtXI8gV0cpK9PS5ImiHp2QSTYeUZutQG2y8SwBJyNkARkk1E/iQGD1VRMhDK2dD1 raMlETllWfqvKG6Jb4KGuvYtUU85JMW+HpJLQGSaVY6SxKx+XxLYeBssEgy3BabpNf/A e/MwMQVXZJWdxJ9V/ZV2rEXQH/Gw3rMVA84P1S3ktOvA6NRFvK2ZaBcCuvkrNSyla/TS +0iTF1nMcqttYom/N6Xmy52UDWplDXoBpEmN6r/iDvd8LLDWFL1VfWL+FEIt5b+Zwf4v 2L8g== Received: by 10.52.90.197 with SMTP id by5mr2118993vdb.90.1348087202829; Wed, 19 Sep 2012 13:40:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.91.194 with HTTP; Wed, 19 Sep 2012 13:39:42 -0700 (PDT) In-Reply-To: References: From: Jukka Zitting Date: Wed, 19 Sep 2012 22:39:42 +0200 Message-ID: Subject: Re: On custom index configuration To: oak-dev@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi, On Wed, Sep 19, 2012 at 9:30 PM, Ard Schrijvers wrote: > I've read the entire thread, and below reply inline to the initial > proposal of Jukka as I have some doubts in that area: Great comments, thanks for joining the discussion! > The only way I could imagine we already gain a lot compared to jr 2.x > and still have performance is if we have the backing storage contain > (and maintain like indexing new nodes) the indexes (just like Jukka > suggests), but repository (jvm) instances load the entire index nodes > from the repository to local FS. If the repository index is an append > only binary (for example append only the binary segments as new > binaries to an index just like Lucene does) then perhaps it could > perform That's the idea. All frequently accessed binaries can and should be kept locally, which should make the index perform pretty well. This isn't implemented yet (currently the LuceneIndex simply reads all index binaries to memory...), so there still is no way to benchmark the idea in practice. But at least from a design perspective I don't see any major reasons why this solution couldn't perform at least reasonably close to what Lucene achieves when directly accessing a local file system. > And here I think I have my other doubts. For example, Lucene needs the > same analyzers query time as were used indexing time. Now, if I would > have an English spellchecker for the index at / and a French for the > index at /data, then, I cannot see how you could ever query both > indexes in one go. Similarly if the index at / indexes title property > as String (single token) and the index at /data indexes the title as > Text (tokenized). How can you now query the title at / The index at / indexes content from the entire tree, also from within /data. The fact that there's an extra index at /data wouldn't affect the index at / in any way. Therefore you can still easily query for title at / in English and get correct results also from within /data. > So, I do think it is nice to be able to configure multiple index > configuration for different parts of the jcr tree, but I doubt about > supporting nested indexes that are backed by different index > configuration. Without the nesting, I think it would work. As mentioned above, the idea is not for the indexes to be nested. (I previously toyed with the idea of a hierarchical map-reduce -like mechanism for building an index incrementally across the whole tree, but that's a different discussion and probably won't be implemented unless there's some particular use case for something like that.) > Thus, query for / uses the index for /. Query for /data uses just > the index for /data, not the one from / The index selection process is a bit more complicated than that. Basically for each query we'd look up all the potentially applicable indexes, and then each index is asked to estimate how efficiently it could execute a given query, for example /jcr:root/data//*[@title='foo']. The index at / would notice that it does keep track of the title property so it can do a property constraint pretty efficiently, but probably won't be that fast in evaluating the path constraint. The index at /data on the other hand could do both constraints efficiently, so the query engine will pick that one. On the other hand, if the query was about some other property, like /jcr:root/data//*[@author='bar'], and that property is only indexed at /, then that index would likely get selected by the query engine over the one at /data. BR, Jukka Zitting