jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: AutoCompelete
Date Thu, 25 Nov 2010 18:34:03 GMT
On Thu, Nov 25, 2010 at 1:46 PM, Alexander Klimetschek
<aklimets@adobe.com> wrote:
> On 25.11.10 12:11, "Ard Schrijvers" <a.schrijvers@onehippo.com> wrote:
>>I would not use all the terms from the term enum, but only the terms
>>belonging to some (multivalued) property. For example
>>Perhaps this is what you actually meant in your first mail I now
>>realize. I only thought you would just fetch all the property values
>>by jcr calls,
> No, I meant a separate index for auto-completion, that would be stored as
> JCR tree. It would be re-generated every now and then (doesn't need to be
> real-time at all, since the "popularity" will in most cases not change
> dramatically very quickly).
> And the second thing is where to build the index from: Lucene term space,
> search statistics, a simple dictionary, etc.
>>where I suggested to expose the Lucene term space (thus
>>probably only for some property the terms) as a hierarchical tree, for
>>  |- p
>>  |   |- e
>>  |   `- s
>>  |-c
>>  `-r
>>If you type, 'ap' the suggestions are 'ape' and 'aps'
> My structure is similar, only that on each node (also on "a" or "a/p")
> there is a multi-value property containing the terms to show. For example:
> a/@terms = [ "alpha", "argon", ...] and a/p/@terms = ["ape", "apple", ...]
> This is because:
> - if popularity is taken into account, the list might be different on "a"
> than on "a/p"
> - calculating the list for "a" would require to iterate down into the leaf
> nodes, hurting performance
>>>Otherwise, how do you select the ~10 items to show for 1, 2 or 3 letter
>>> inputs? You need some use-case-dependent priorities for each term, for
>>> example the popularity of those terms gathered from the search interface
>>> itself.
>>This info is pretty much in the Lucene term space: So, you can
>>retrieve all the terms for some property (tags property). How often
>>the terms are used is also present (popularity)
> Yes. However, this requires that you want the popularity of the terms be
> defined by how often they appear in your repository. And it requires that
> you want to can live with the shared index for the entire content, that
> might contain multiple websites, users, etc., probably "messing" the index
> up a bit.
> Using terms only from certain properties (like myproject:tags) can help,
> but I think a dedicated index is still faster. It is essentially just a
> single node-bundle read, and optimized access-based caching comes for free
> then with the item state management in Jackrabbit.
> So if the Lucene index is used as source, I would still generate the index
> as described and later do the auto-completion from there.

Thx a lot for your detailed outline. Although I do not think
performance will be an issue when using lucene term space as I
suggested, but I certainly see benefits from a separate index (for
example you don't use stemming for that index)

A complete different question: You also expose them as virtual jcr
trree structures  right? Is this directly possible in jackrabbit or
did you add some extension? We have added some extensions to make some
extra things possible (like exposing virtual tree on demand below some
a node of some node type, for example faceted navigation). I still
would like to see these become part of jackrabbit. Just wondering how
you achieved it. I know jr has some support for virtual layers

Regards Ard

> Regards,
> Alex
> --
> Alexander Klimetschek
> Developer // Adobe (Day) // Berlin - Basel

Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522 4466
USA  • San Francisco  185 H Street Suite B  •  Petaluma CA 94952-5100
•  +1 (707) 773 4646
Canada    •   Montréal  5369 Boulevard St-Laurent  •  Montréal QC H2T
1S5  •  +1 (514) 316 8966
www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com

View raw message