jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: On custom index configuration
Date Wed, 19 Sep 2012 19:30:51 GMT
Hello Jukka et al,

I've read the entire thread, and below reply inline to the initial
proposal of Jukka as I have some doubts in that area:

On Tue, Sep 18, 2012 at 5:14 PM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
> First of all I think there shouldn't be just one single place in the
> repository where all index configuration should go. It would be nice
> if users and applications could define custom indexes on areas they
> have write access to, and having to grant them access to some shared
> location for that might be troublesome.
> Instead I'd allow a custom indexes to be defined by adding something
> like an oak:indexed mixin type and an associated oak:indexes child
> node to any node in the repository. Each child node of that
> oak:indexes node would configure an index for the subtree rooted at
> that oak:indexed node. Index configuration would be stored as normal
> content, and the index content in a hidden :index subtree or elsewhere
> depending on the type of the index.

Having the Lucene indexes inside the repository is of course really
really nice, as currently (jr 2.x), bringing up a new cluster
repository node means you first have to index the entire repository to
create a *local* FS Lucene index (or actually indexes). That said, of
course it is really nice, but, I didn't yet hear of *any* successful
Lucene implementation that did not have the Lucene indexes near the
computation. Thus having the Lucene indexes in, say some noSQL store
or database, pretty much means it will never perform afaiu.

Also, I've talked to Simon Willnauer (Lucene chair) a couple of times
about these kind of attempts. He says Lucene will *never* perform if
the data (indexes) are not near the computation.

So, if we want to store the lucene indexes in the oak repository in
binary fields, how will they ever be 'near' the computation?

OTOH, I must be missing something because I expressed these concerns
before to Jukka so he must know something that I don't if he is still
confident this will work :)

The only way I could imagine we already gain a lot compared to jr 2.x
and still have performance is if we have the backing storage contain
(and maintain like indexing new nodes) the indexes  (just like Jukka
suggests), but repository (jvm) instances load the entire index nodes
from the repository to local FS. If the repository index is an append
only binary (for example append only the binary segments as new
binaries to an index just like Lucene does) then perhaps it could


> When executing a query, the search engine in Oak would then detect all
> indexes along the main path axis of a given query. For example, when
> querying for content inside /data/foo, the search engine would use the
> indexes at / and /data, but not the ones at /articles.

And here I think I have my other doubts. For example, Lucene needs the
same analyzers query time as were used indexing time. Now, if I would
have an English spellchecker for the index at / and a French for the
index at /data, then, I cannot see how you could ever query both
indexes in one go. Similarly if the index at / indexes title property
as String (single token) and the index at /data indexes the title as
Text (tokenized). How can you now query the title at /

So, I do think it is nice to be able to configure multiple index
configuration for different parts of the jcr tree, but I doubt about
supporting nested indexes that are backed by different index
configuration. Without the nesting, I think it would work. Thus, query
for / uses the index for /. Query for /data uses just the index for
/data, not the one from /

These are my concerns...unfortunately I cannot join the upcoming oak
hackathon due to holiday, but otherwise I would have been very
interested in the details I don't understand

Regards Ard

> Removing a custom index would be a simple matter of removing the
> respective index configuration node. For example, to remove the full
> text index defined above, one would do:
>     Session session  = ...;
>     session.getNode("/data/oak:indexes/fulltext").remove();
>     session.save();
> BR,
> Jukka Zitting

View raw message