jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: Re: improving the scalability in searching
Date Mon, 20 Aug 2007 11:58:11 GMT
> Christoph Kiehl wrote: 
> I'm a bit indifferent about 1) because I think the change is 
> not fundamentally 
> enough to justify a new QueryHandler class. Do you have any 
> other plans with the 
> new QueryHandler implementation? If I were to implement a SQL 
> based QueryHandler 
> solution I would create a new QueryHandler implementation, 
> but not for a small 
> change like that. 

Well, about other changes, I have some in mind, but I might be seeing the big picture wrong:
I have been looking through the indexing code, and I just seem to be unable to understand
why all properties are indexed within the same lucene field, '_:PROPERTIES'. AFAICS, it complicates
queries. Are the reasons for this somewhere in the 'ChildAxisQuery', 'DerefQuery', 'ParentAxisQuery'
or some other (I haven't looked at these classes yet, so do not know how they work)? 

But, for me it seems much more a natural lucene index fit to use a seperate lucene Field for
*every* unique property name. So, indexing a propety modificationDate, does not result in
a lucene Field:

<_PROPERTIES> 1:modificationDate?ms27115hc 


<1:modificationDate> ms27115hc

This is IMO a much clearer way to index. I think it makes classes like SharedFieldSortComparator
redundant, because we can use the standard lucene sort (it seems to me that this sort is more
efficient than the current JR one. Although I did not investigate is, I know that the longer
the field values you sort on in lucene, the higher the memory consumption. Certainly when
sorting is done on large result sets, a string prefix like '1:modificationDate?' can differ
*many* Mb's in memory. OTOH, perhaps the SharedFieldSortComparator takes care of this in JR,
I am not sure)

Furthermore, indexing properties in lucene with there own property name makes you more flexible
in implementing new kinds of searches. For example, give me all different 'authors' and do
a count of how many articles each author has, ie facetted browsing. Facetted browsing is with
the current indexing strategy much harder.  

And, as a possible add on to the indexing configuration class, but I need to know what you
people think about it (and if it is possible to be jsr 170/283 compliant), I have been thinking
about enriching the index via the indexing configuration with 'virtual properties' (I am not
sure by the way what this org.apache.jackrabbit.core.virtual does, haven't looked at it...perhaps
it coincides with my ideas, but somebody else might know). Suppose I am having a property
with a Calendar date. I want in the frontend to be able to search for articles in week X.
I do not want to store week X as a property, because it is an implicit part of the date I
already have. I would like to define in indexing configuration that myproperty also needs
to be index as myproperty_weeknr for example (and specify an analyzer that does this for you),
and that I can query on this one. Just like I would do with the first letter of each author,
to efficiently query all authors starting with an "a". Could this be implemented according
the jsr spec, or is this really not compatible?

So, WDOT about indexing properties in seperate lucene Fields, and about possibly indexing
more information of one property. My experience with lucene, is that indexing tactically,
eases querying a lot, and gains you lots of performance. So, if you do agree on these changes,
which I can try to build in Jackrabbit, then I think these changes might validate a new QueryHandler
class to be build aside the old one. WDOT? 

Regards Ard


View raw message