jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Klimetschek <aklim...@adobe.com>
Subject Re: AutoCompelete
Date Thu, 25 Nov 2010 12:46:17 GMT
On 25.11.10 12:11, "Ard Schrijvers" <a.schrijvers@onehippo.com> wrote:
>I would not use all the terms from the term enum, but only the terms
>belonging to some (multivalued) property. For example
>Perhaps this is what you actually meant in your first mail I now
>realize. I only thought you would just fetch all the property values
>by jcr calls,

No, I meant a separate index for auto-completion, that would be stored as
JCR tree. It would be re-generated every now and then (doesn't need to be
real-time at all, since the "popularity" will in most cases not change
dramatically very quickly).

And the second thing is where to build the index from: Lucene term space,
search statistics, a simple dictionary, etc.

>where I suggested to expose the Lucene term space (thus
>probably only for some property the terms) as a hierarchical tree, for
>  |- p
>  |   |- e
>  |   `- s
>  |-c
>  `-r
>If you type, 'ap' the suggestions are 'ape' and 'aps'

My structure is similar, only that on each node (also on "a" or "a/p")
there is a multi-value property containing the terms to show. For example:
a/@terms = [ "alpha", "argon", ...] and a/p/@terms = ["ape", "apple", ...]
This is because:

- if popularity is taken into account, the list might be different on "a"
than on "a/p"
- calculating the list for "a" would require to iterate down into the leaf
nodes, hurting performance

>>Otherwise, how do you select the ~10 items to show for 1, 2 or 3 letter
>> inputs? You need some use-case-dependent priorities for each term, for
>> example the popularity of those terms gathered from the search interface
>> itself.
>This info is pretty much in the Lucene term space: So, you can
>retrieve all the terms for some property (tags property). How often
>the terms are used is also present (popularity)

Yes. However, this requires that you want the popularity of the terms be
defined by how often they appear in your repository. And it requires that
you want to can live with the shared index for the entire content, that
might contain multiple websites, users, etc., probably "messing" the index
up a bit.

Using terms only from certain properties (like myproject:tags) can help,
but I think a dedicated index is still faster. It is essentially just a
single node-bundle read, and optimized access-based caching comes for free
then with the item state management in Jackrabbit.

So if the Lucene index is used as source, I would still generate the index
as described and later do the auto-completion from there.


Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

View raw message