jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: On custom index configuration
Date Wed, 19 Sep 2012 20:39:42 GMT

On Wed, Sep 19, 2012 at 9:30 PM, Ard Schrijvers
<a.schrijvers@onehippo.com> wrote:
> I've read the entire thread, and below reply inline to the initial
> proposal of Jukka as I have some doubts in that area:

Great comments, thanks for joining the discussion!

> The only way I could imagine we already gain a lot compared to jr 2.x
> and still have performance is if we have the backing storage contain
> (and maintain like indexing new nodes) the indexes  (just like Jukka
> suggests), but repository (jvm) instances load the entire index nodes
> from the repository to local FS. If the repository index is an append
> only binary (for example append only the binary segments as new
> binaries to an index just like Lucene does) then perhaps it could
> perform

That's the idea. All frequently accessed binaries can and should be
kept locally, which should make the index perform pretty well. This
isn't implemented yet (currently the LuceneIndex simply reads all
index binaries to memory...), so there still is no way to benchmark
the idea in practice. But at least from a design perspective I don't
see any major reasons why this solution couldn't perform at least
reasonably close to what Lucene achieves when directly accessing a
local file system.

> And here I think I have my other doubts. For example, Lucene needs the
> same analyzers query time as were used indexing time. Now, if I would
> have an English spellchecker for the index at / and a French for the
> index at /data, then, I cannot see how you could ever query both
> indexes in one go. Similarly if the index at / indexes title property
> as String (single token) and the index at /data indexes the title as
> Text (tokenized). How can you now query the title at /

The index at / indexes content from the entire tree, also from within
/data. The fact that there's an extra index at /data wouldn't affect
the index at / in any way. Therefore you can still easily query for
title at / in English and get correct results also from within /data.

> So, I do think it is nice to be able to configure multiple index
> configuration for different parts of the jcr tree, but I doubt about
> supporting nested indexes that are backed by different index
> configuration. Without the nesting, I think it would work.

As mentioned above, the idea is not for the indexes to be nested. (I
previously toyed with the idea of a hierarchical map-reduce -like
mechanism for building an index incrementally across the whole tree,
but that's a different discussion and probably won't be implemented
unless there's some particular use case for something like that.)

> Thus, query for / uses the index for /. Query for /data uses just
> the index for /data, not the one from /

The index selection process is a bit more complicated than that.

Basically for each query we'd look up all the potentially applicable
indexes, and then each index is asked to estimate how efficiently it
could execute a given query, for example
/jcr:root/data//*[@title='foo']. The index at / would notice that it
does keep track of the title property so it can do a property
constraint pretty efficiently, but probably won't be that fast in
evaluating the path constraint. The index at /data on the other hand
could do both constraints efficiently, so the query engine will pick
that one.

On the other hand, if the query was about some other property, like
/jcr:root/data//*[@author='bar'], and that property is only indexed at
/, then that index would likely get selected by the query engine over
the one at /data.


Jukka Zitting

View raw message