jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Controlling the Lucene indexing
Date Wed, 10 Jun 2009 09:48:58 GMT

> Feature-downgrading doesn't help me. I have a clear need how nodes must
> be indexed. Since there seems to be no easy way to do this in
> Jackrabbit, so I fall back to my own index.
> And that's ok for me. I just asked (although I looked at the code) to
> not miss something.
> Thanks for all your patience and help!

Sry for dropping in so late, I have been to occupied lately to even
follow the list. Hope to have more time in near future. Anyways, my 2
cents about this:

First of all, I think we already exactly did exactly what Bernd wants
(without hooking in your own index and keep it in sync: i wouldn't go
there). Also, I would favor support for more indexing tuning within
Jackrabbit. I think everybody accustomed to lucene, or Solr, is used
to define *how* fields should be indexed. Now, we have quite some
support for this in JackRabbit already, see [1], but some tuning is
missing (for example I would like to be able to indicate that lucene
should not index some property (summary) as a single term, as I never
want to sort on it, or use equals for it....)

As you can find at the buttom of [1], you can configure a *per*
property analyzer. So, if I know I have some comma seperate keywords
property, I could configure the property to be indexed with my
CommaSeperatedKeyWordAnalyzer. So basically, support for it is there.

Now, what is not currently easy to achieve is for example indexing one
and the same property in multiple ways: If for example I am indexing a
title property, I might also want to index a short_title (which is not
a jcr property, just index field only) : now, if I have 1.000.000 text
documents having a title, I could still do a sort on short_title,
whereas sorting on the normal title field will result in instant OOM
(I am actually facing this, and will use some kind of 'short_title'
strategy...where similar stuff applies for date range queries wrt
different granularities)

Anyways, back to Bernd's thing:

We provided faceted navigation exposed over jcr as virtual structures.
Similar I am planning to do so for Taxonomy navigation, tagging
navigation, lucene term space navigation (exposing auto-completion
options), similar nodes navigation, broken link checker, etc etc. But,
some of them, for example faceted navigation needed, I think, the very
same thing Bernd wants: controlling how properties are being indexed.

As Marcel points out, you only need to extend the SearchIndex and
override createDocument. You can then also use your own NodeIndexer
impl which indexes all the properties (and index a single property in
multiple ways) the way you want. The only drawback that makes it a
little more difficult to write, is that for historical reasons (as it
was not possible with the lucene version used at the time of writing
the jr indexing) all properties end up in a single lucene property
field, hence, you need to do some field value prefixing tricks (not
hard, just a little annoying and confusing from time to time :-))

In your repository.xml, in the <SearchIndex> element, change for example:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex"> to

<SearchIndex class=",,,..query.lucene.MySearchIndex">

Anyways, you can take look at [2] and [3] for examples, though they
are already somewhat more sophisticated as we added custom indexing
configurations as well for optimizing indexing (for example date
granularity config which I actually still need to add)

Hope this helps

Regards Ard

[1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
[2] http://svn.onehippo.org/repos/hippo/hippo-ecm/trunk/repository/engine/src/main/java/org/hippoecm/repository/query/lucene/ServicingSearchIndex.java
[3] http://svn.onehippo.org/repos/hippo/hippo-ecm/trunk/repository/engine/src/main/java/org/hippoecm/repository/query/lucene/ServicingNodeIndexer.java

>  Bernd
>> regards
>>  marcel
>>> The downside of not being able to do this (controlling Lucene doc
>>> creation) is having another, self-managed index, and (re-)indexing must
>>> be done by hand, using JCR listeners or some other approach.
>>>  Bernd

View raw message