jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Re (OAK-36) Implement a query parser - what about indexing?
Date Fri, 23 Mar 2012 09:40:55 GMT
On Thu, Mar 22, 2012 at 9:36 AM, Thomas Mueller <mueller@adobe.com> wrote:
> Hi,
>>OAK-36 covers the Query implementation effort, but I'm wondering if now
>>would be a good time to mention indexing as well.
>>We want to have dedicated indexes, I think that would be accomplished via
>>Any ideas about the availability of this feature?
> Sure. One such a mechanism is implemented, and currently lives under
> org.apache.jackrabbit.mk.index. It is not yet "wired" to
> org.apache.jackrabbit.oak.query.index. This mechanism stores the index
> data in nodes and properties, as a tree (using just the MicroKernel API).
> This mechanism is supposed to be as scalable as the MicroKernel
> implementation (support concurrent writes if the MicroKernel
> implementation supports it).
>>The current index implementation just traverses the existing nodes (albeit
>>applying some path constraints first),
> Yes, that's org.apache.jackrabbit.oak.query.index.TraversingReader
>>This helps with testing the query parser & friends, but a lucene based
>>query engine needs events to update its data.
> Given the scalability requirements defined at [1] (specially concurrent,
> scalable writes in multiple cluster nodes) we plan to support other
> (non-Lucene) index mechanisms as well. Personally, I believe we should use
> Lucene for fulltext indexing, because that's what Lucene is meant for. But
> I'm not sure how a fully scalable fulltext index using Lucene would look
> like. That's still an open question we need to resolve, or define the
> limitations in this area.

I'd opt for not implementing a fulltext search index at all in the
repository, but rather have some good places to hook in an 'external'
index. I should had written my/our (Hippo) use cases already in a mail
before but never got to it. I've come to believe, that free text
search / full text indexing is too domain specific to be caught in a
generic one fits all solution. Imo, full text indexing is very much
related to how your 'domain model' is mapped to jcr nodes. A generic
repository full text index will index jcr nodes, while, for example at
Hippo, we are interested in indexing 'documents' : A document can be
some small bonzai tree of nodes. I know there has been made attempts
for indexing_configuration kind of tuning, but, imho, it just does not
work that well.

Also, the jr indexes are quite inefficient in general : In our case,
for just a couple of hundreds of thousands of documents, the number of
jcr nodes easily exceeds many millions: The (Lucene / full text)
indexes are much bigger than needed. For the current jr 2 indexes, it
is also the case that pretty much every string property gets stored in
the index as well, to do a 'equals' : If for oak, the equality checks
are done against a different (node index) instead of Lucene, it will
be very hard to combine the results.

Although I am on thin ice here, I think there are hardly any noSQL
stores out there that actually include full text indexes. I think we
shouldn't try to address it in the repository, but rather provide some
tooling to easily setup a (external) full text index (like plain
Lucene, or use Solr/Elastic search) according someones exact needs
(like, which analyzer to use for which part of the content, which
properties should be stored, which properties should be analyzed in
which ways, which properties are meant for TrieRanges,  etc etc)

Regards Ard

> [1]:
> http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
> bit%203
> Regards,
> Thomas

View raw message