jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: AutoCompelete
Date Thu, 25 Nov 2010 10:00:30 GMT
On Thu, Nov 25, 2010 at 9:48 AM, Alexander Klimetschek
<aklimets@adobe.com> wrote:
> On 24.11.10 22:29, "Ard Schrijvers" <a.schrijvers@onehippo.com> wrote:
>>On Wed, Nov 24, 2010 at 10:03 PM, Zhou Wu <zwu_ca@yahoo.com> wrote:
>>> I'm trying to do some thing like
>>> org.apache.jackrabbit.core.query.lucene.spell.SpellChecker for
>>> When user type in the search input box, a list of words (phrases) that
>>> up like Google suggestion.  I searched on the web and got
>>> that looks like helpful. But I don't know how to start to get it work
>>> Jackrabbit. Could any one give some tips? Thanks,
>>Afaiu, Spellchecker wouldn't fit auto completion. Auto completion is
>>about suggesting existing terms in the index after you typed, say
> Exactly, spellcheck is about getting from "jeck" to "jack", but
> autocompletion (in its hardest form) is about getting from typing an "j"
> to a list like "jack, jupiter, jelly, january, ...".
> Also there are different use cases as what to show in auto-completion
> (always showing all possibilities doesn't work ;-)) and it is language-
> and region dependent.
> Since those few-letter inputs like "j" will be the most frequent ones, as
> people are typing words one-by-one, you want to directly lookup those
> terms from a pre-built index as directly as possible. For this, you can
> have something like "j/ja/jac" in the repository. On each level there is a
> multi-value property containing the auto-completions/suggestions you want
> to show (10 is a good number for example, used by google).

Ah, you suggest to manually keep track of the 'auto-suggest' list,
right? Just read them all in once, have some observer for changes, et
voila. That works, but I wanted to build it differently myself

I want to deliver the feature for us in a different way: Expose the
Lucene term enum as a virtual hierarchical node tree, where every node
is a single letter. This is very efficient, and easy to build once
virtual layers are up&running. The only thing I am struggling with in
my head is about Lucene stemming: the term enum then contains stemmed
words. OTOH, imo, the complete stemming concept in Lucene has been
broken from the start, I never advice stemming. Removing diacritics is
enough. (Lucene 4.0 won't need stemming any more ever, as you can do
everything with fuzzy searches because of a new bleeding edge
automaton query...first upgrade jackrabbit however :-))

Regards Ard

> How this index is built in the first time, depends on the use case. For
> example, the Google search shows you terms that are currently popular, so
> they probably update that index based on query statistics like one or two
> times a day. To start, you can use a dictionary, filter out stop words
> like "the", "and" etc. and build that index automatically. Then you only
> get single words - Google also shows full searches, like "jack wolfskin".
> And there are probably many other sources you can build such an index from.
> Hope that helps,
> Alex
> --
> Alexander Klimetschek
> Developer // Adobe (Day) // Berlin - Basel

Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522 4466
USA  • San Francisco  185 H Street Suite B  •  Petaluma CA 94952-5100
•  +1 (707) 773 4646
Canada    •   Montréal  5369 Boulevard St-Laurent  •  Montréal QC H2T
1S5  •  +1 (514) 316 8966
www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com

View raw message