jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: improving the scalability in searching
Date Tue, 14 Aug 2007 09:32:45 GMT

> I agree with you that the current implementation is not 
> optimized for queries 
> that check the existence of a property. Your proposed 
> solution seems reasonable, 
> I would implement it the same way. There's just one minor 
> obstacle, how do we 
> implement this change in a backward compatible way? an 
> existing index without 
> this additional field should still work.

Apart from a possible solution, the policy is that moving some tag to the latest jackrabbit
version should always be possible without having to re-index? Is it not an option to have
some kind of warning that re-indexing is needed when mocing to version x ? 

My experience though with other repositories (slide) and a custom lucene indexing layer on
top of it handling all searches, is that for efficient querying, I quite frequently had to
change some indexing settings, which implied re-indexing the entire repository. IMO, when
you need a performant search implementation, you need to be able to tune the parts you index,
and you need to be able to query on these. I think a single property should be possible to
index in different customizable ways. Might this be an option for the indexingConfiguration,
to be able to index a single property in multiple ways? For example: each article(node) has
an author property. I have 10.000.000 nodes. Now, I want to see the number of documents for
each author with his name starting with an "S". The only way to query this efficiently AFAICS,
is querying for some indexed field that holds the starting letter of an author (perhaps configuring
in the indexing configuration that the author name should also be indexed in a seperate property,
for example with a configured analyzer that used the EdgeNGramTokenizer from lucene to index
the first letter only. for example something like:

    <property name="author">
	 <copyField dest="author-starting-letter" analyzer="mypackage.FirstLetterAnalyzer"/>
    <property name="publishdate">
	 <copyField dest="publishdate-weeknumber" analyzer="mypackage.DateWeeknumberAnalyzer"/>

where for example the publishdate-weeknumber holds the week number of a date (if you need
fast searching for all published articles in week X, but the weeknumber is not a propery of
the document)

But this might complicate indexing configuration obviously quite a bit, and you might need
to query on "virtual" properties not defined in .cnd files, which proabably is not possible
(though, I do not yet know enough of that part...is this possible with the org.apache.jackrabbit.core.virtual

Bottom line, before thinking about best way to find improved version for querying nodes for
existing props, is it allowed that a new jackrabbit release forces people to re-index? IMO,
it is quite a limitation if this is never allowed (AFAIK, a lucene index might also become
corrupted or in clustered environments get out of sync, that makes a re-index needed )

Regards Ard

> regards
>   marcel

View raw message