Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: pass (nike.apache.org: domain of a.schrijvers@1hippo.com
 designates 64.18.2.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <C913DEBD.31A5E%aklimets@adobe.com>
References: <AANLkTinhADkJWy5XmW-aggQGgM9yMo0R8tPERikOsdR_@mail.gmail.com>
	<C913DEBD.31A5E%aklimets@adobe.com>
Date: Thu, 25 Nov 2010 11:00:30 +0100
Message-ID: <AANLkTinFcwD6=oxGckmeM3tauVixcffQTOGk47hFmmnW@mail.gmail.com>
Subject: Re: AutoCompelete
From: Ard Schrijvers <a.schrijvers@onehippo.com>
To: users@jackrabbit.apache.org
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

On Thu, Nov 25, 2010 at 9:48 AM, Alexander Klimetschek
<aklimets@adobe.com> wrote:
> On 24.11.10 22:29, "Ard Schrijvers" <a.schrijvers@onehippo.com> wrote:
>
>>On Wed, Nov 24, 2010 at 10:03 PM, Zhou Wu <zwu_ca@yahoo.com> wrote:
>>> I'm trying to do some thing like
>>> org.apache.jackrabbit.core.query.lucene.spell.SpellChecker for
>>>autocomplete:
>>> When user type in the search input box, a list of words (phrases) that
>>>pops
>>> up like Google suggestion. =A0I searched on the web and got
>>>
>>>http://stackoverflow.com/questions/120180/how-to-do-query-auto-completio=
n
>>>-suggestions-in-lucene
>>> that looks like helpful. But I don't know how to start to get it work
>>>with
>>> Jackrabbit. Could any one give some tips? Thanks,
>>
>>Afaiu, Spellchecker wouldn't fit auto completion. Auto completion is
>>about suggesting existing terms in the index after you typed, say
>>'jack'.
>
> Exactly, spellcheck is about getting from "jeck" to "jack", but
> autocompletion (in its hardest form) is about getting from typing an "j"
> to a list like "jack, jupiter, jelly, january, ...".
>
> Also there are different use cases as what to show in auto-completion
> (always showing all possibilities doesn't work ;-)) and it is language-
> and region dependent.
>
> Since those few-letter inputs like "j" will be the most frequent ones, as
> people are typing words one-by-one, you want to directly lookup those
> terms from a pre-built index as directly as possible. For this, you can
> have something like "j/ja/jac" in the repository. On each level there is =
a
> multi-value property containing the auto-completions/suggestions you want
> to show (10 is a good number for example, used by google).

Ah, you suggest to manually keep track of the 'auto-suggest' list,
right? Just read them all in once, have some observer for changes, et
voila. That works, but I wanted to build it differently myself

I want to deliver the feature for us in a different way: Expose the
Lucene term enum as a virtual hierarchical node tree, where every node
is a single letter. This is very efficient, and easy to build once
virtual layers are up&running. The only thing I am struggling with in
my head is about Lucene stemming: the term enum then contains stemmed
words. OTOH, imo, the complete stemming concept in Lucene has been
broken from the start, I never advice stemming. Removing diacritics is
enough. (Lucene 4.0 won't need stemming any more ever, as you can do
everything with fuzzy searches because of a new bleeding edge
automaton query...first upgrade jackrabbit however :-))

Regards Ard

>
> How this index is built in the first time, depends on the use case. For
> example, the Google search shows you terms that are currently popular, so
> they probably update that index based on query statistics like one or two
> times a day. To start, you can use a dictionary, filter out stop words
> like "the", "and" etc. and build that index automatically. Then you only
> get single words - Google also shows full searches, like "jack wolfskin".
> And there are probably many other sources you can build such an index fro=
m.
>
> Hope that helps,
> Alex
>
> --
> Alexander Klimetschek
> Developer // Adobe (Day) // Berlin - Basel
>
>
>
>
>


--=20
Hippo
Europe =A0=95 =A0Amsterdam =A0Oosteinde 11 =A0=95 =A01017 WT Amsterdam =A0=
=95 =A0+31 (0)20 522 4466
USA =A0=95 San Francisco =A0185 H Street Suite B =A0=95 =A0Petaluma CA 9495=
2-5100
=95 =A0+1 (707) 773 4646
Canada =A0 =A0=95 =A0 Montr=E9al =A05369 Boulevard St-Laurent =A0=95 =A0Mon=
tr=E9al QC H2T
1S5 =A0=95 =A0+1 (514) 316 8966
www.onehippo.com =A0=95 =A0www.onehippo.org =A0=95 =A0info@onehippo.com