Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 59029 invoked from network); 25 Nov 2010 10:01:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Nov 2010 10:01:02 -0000 Received: (qmail 86151 invoked by uid 500); 25 Nov 2010 10:01:01 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 86006 invoked by uid 500); 25 Nov 2010 10:01:00 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 85996 invoked by uid 99); 25 Nov 2010 10:01:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Nov 2010 10:01:00 +0000 X-ASF-Spam-Status: No, hits=-1.2 required=10.0 tests=FRT_ADOBE2,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of a.schrijvers@1hippo.com designates 64.18.2.179 as permitted sender) Received: from [64.18.2.179] (HELO exprod7og113.obsmtp.com) (64.18.2.179) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 25 Nov 2010 10:00:53 +0000 Received: from source ([209.85.161.179]) by exprod7ob113.postini.com ([64.18.6.12]) with SMTP ID DSNKTO4zvyOhmqBGarB9UHbxBzYN8nHm1GV1@postini.com; Thu, 25 Nov 2010 02:00:32 PST Received: by gxk21 with SMTP id 21so371457gxk.10 for ; Thu, 25 Nov 2010 02:00:31 -0800 (PST) MIME-Version: 1.0 Received: by 10.100.249.10 with SMTP id w10mr429477anh.244.1290679230970; Thu, 25 Nov 2010 02:00:30 -0800 (PST) Received: by 10.236.95.11 with HTTP; Thu, 25 Nov 2010 02:00:30 -0800 (PST) In-Reply-To: References: Date: Thu, 25 Nov 2010 11:00:30 +0100 Message-ID: Subject: Re: AutoCompelete From: Ard Schrijvers To: users@jackrabbit.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Thu, Nov 25, 2010 at 9:48 AM, Alexander Klimetschek wrote: > On 24.11.10 22:29, "Ard Schrijvers" wrote: > >>On Wed, Nov 24, 2010 at 10:03 PM, Zhou Wu wrote: >>> I'm trying to do some thing like >>> org.apache.jackrabbit.core.query.lucene.spell.SpellChecker for >>>autocomplete: >>> When user type in the search input box, a list of words (phrases) that >>>pops >>> up like Google suggestion. =A0I searched on the web and got >>> >>>http://stackoverflow.com/questions/120180/how-to-do-query-auto-completio= n >>>-suggestions-in-lucene >>> that looks like helpful. But I don't know how to start to get it work >>>with >>> Jackrabbit. Could any one give some tips? Thanks, >> >>Afaiu, Spellchecker wouldn't fit auto completion. Auto completion is >>about suggesting existing terms in the index after you typed, say >>'jack'. > > Exactly, spellcheck is about getting from "jeck" to "jack", but > autocompletion (in its hardest form) is about getting from typing an "j" > to a list like "jack, jupiter, jelly, january, ...". > > Also there are different use cases as what to show in auto-completion > (always showing all possibilities doesn't work ;-)) and it is language- > and region dependent. > > Since those few-letter inputs like "j" will be the most frequent ones, as > people are typing words one-by-one, you want to directly lookup those > terms from a pre-built index as directly as possible. For this, you can > have something like "j/ja/jac" in the repository. On each level there is = a > multi-value property containing the auto-completions/suggestions you want > to show (10 is a good number for example, used by google). Ah, you suggest to manually keep track of the 'auto-suggest' list, right? Just read them all in once, have some observer for changes, et voila. That works, but I wanted to build it differently myself I want to deliver the feature for us in a different way: Expose the Lucene term enum as a virtual hierarchical node tree, where every node is a single letter. This is very efficient, and easy to build once virtual layers are up&running. The only thing I am struggling with in my head is about Lucene stemming: the term enum then contains stemmed words. OTOH, imo, the complete stemming concept in Lucene has been broken from the start, I never advice stemming. Removing diacritics is enough. (Lucene 4.0 won't need stemming any more ever, as you can do everything with fuzzy searches because of a new bleeding edge automaton query...first upgrade jackrabbit however :-)) Regards Ard > > How this index is built in the first time, depends on the use case. For > example, the Google search shows you terms that are currently popular, so > they probably update that index based on query statistics like one or two > times a day. To start, you can use a dictionary, filter out stop words > like "the", "and" etc. and build that index automatically. Then you only > get single words - Google also shows full searches, like "jack wolfskin". > And there are probably many other sources you can build such an index fro= m. > > Hope that helps, > Alex > > -- > Alexander Klimetschek > Developer // Adobe (Day) // Berlin - Basel > > > > > --=20 Hippo Europe =A0=95 =A0Amsterdam =A0Oosteinde 11 =A0=95 =A01017 WT Amsterdam =A0= =95 =A0+31 (0)20 522 4466 USA =A0=95 San Francisco =A0185 H Street Suite B =A0=95 =A0Petaluma CA 9495= 2-5100 =95 =A0+1 (707) 773 4646 Canada =A0 =A0=95 =A0 Montr=E9al =A05369 Boulevard St-Laurent =A0=95 =A0Mon= tr=E9al QC H2T 1S5 =A0=95 =A0+1 (514) 316 8966 www.onehippo.com =A0=95 =A0www.onehippo.org =A0=95 =A0info@onehippo.com