jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: On custom index configuration
Date Wed, 19 Sep 2012 21:30:01 GMT
On Wed, Sep 19, 2012 at 10:39 PM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
> On Wed, Sep 19, 2012 at 9:30 PM, Ard Schrijvers
> <a.schrijvers@onehippo.com> wrote:
>> I've read the entire thread, and below reply inline to the initial
>> proposal of Jukka as I have some doubts in that area:
> Great comments, thanks for joining the discussion!


>> The only way I could imagine we already gain a lot compared to jr 2.x
>> and still have performance is if we have the backing storage contain
>> (and maintain like indexing new nodes) the indexes  (just like Jukka
>> suggests), but repository (jvm) instances load the entire index nodes
>> from the repository to local FS. If the repository index is an append
>> only binary (for example append only the binary segments as new
>> binaries to an index just like Lucene does) then perhaps it could
>> perform
> That's the idea.

Ah, good to hear :)

>All frequently accessed binaries can and should be
> kept locally, which should make the index perform pretty well. This
> isn't implemented yet (currently the LuceneIndex simply reads all
> index binaries to memory...), so there still is no way to benchmark

Writing it to local FS instead of memory would then also be an option,
right? Lucene indexes for current 2.x jr tend to get quite large, so
keeping them in memory might get quite big. Lucene also has a bit
better performance for FS indexes compared to memory indexes, but this
won't be too big an issue (it is due to GC overhead, certainly when
the in memory index becomes large)

> the idea in practice. But at least from a design perspective I don't
> see any major reasons why this solution couldn't perform at least
> reasonably close to what Lucene achieves when directly accessing a
> local file system.

Yes, as long as you have the Lucene indexes near the computation,
performance should be at least comparable to normal FS Lucene indexes.

>> And here I think I have my other doubts. For example, Lucene needs the
>> same analyzers query time as were used indexing time. Now, if I would
>> have an English spellchecker for the index at / and a French for the
>> index at /data, then, I cannot see how you could ever query both
>> indexes in one go. Similarly if the index at / indexes title property
>> as String (single token) and the index at /data indexes the title as
>> Text (tokenized). How can you now query the title at /
> The index at / indexes content from the entire tree, also from within
> /data. The fact that there's an extra index at /data wouldn't affect
> the index at / in any way. Therefore you can still easily query for
> title at / in English and get correct results also from within /data.
>> So, I do think it is nice to be able to configure multiple index
>> configuration for different parts of the jcr tree, but I doubt about
>> supporting nested indexes that are backed by different index
>> configuration. Without the nesting, I think it would work.
> As mentioned above, the idea is not for the indexes to be nested. (I
> previously toyed with the idea of a hierarchical map-reduce -like
> mechanism for building an index incrementally across the whole tree,
> but that's a different discussion and probably won't be implemented
> unless there's some particular use case for something like that.)
>> Thus, query for / uses the index for /. Query for /data uses just
>> the index for /data, not the one from /
> The index selection process is a bit more complicated than that.
> Basically for each query we'd look up all the potentially applicable
> indexes, and then each index is asked to estimate how efficiently it
> could execute a given query, for example
> /jcr:root/data//*[@title='foo']. The index at / would notice that it
> does keep track of the title property so it can do a property
> constraint pretty efficiently, but probably won't be that fast in
> evaluating the path constraint. The index at /data on the other hand
> could do both constraints efficiently, so the query engine will pick
> that one.
> On the other hand, if the query was about some other property, like
> /jcr:root/data//*[@author='bar'], and that property is only indexed at
> /, then that index would likely get selected by the query engine over
> the one at /data.

Thanks for your detailed explanation Jukka. It is now more clear to me
how you want to manage it. It does seem quite complex to me to
implement, but with enough transpiration it might work out :))

Regards Ard

> BR,
> Jukka Zitting

Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466

View raw message