Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 85558 invoked from network); 14 Feb 2008 10:25:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Feb 2008 10:25:57 -0000 Received: (qmail 31193 invoked by uid 500); 14 Feb 2008 10:25:43 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 31170 invoked by uid 500); 14 Feb 2008 10:25:43 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 31159 invoked by uid 99); 14 Feb 2008 10:25:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Feb 2008 02:25:43 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cambazz@gmail.com designates 64.233.182.189 as permitted sender) Received: from [64.233.182.189] (HELO nf-out-0910.google.com) (64.233.182.189) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Feb 2008 10:25:12 +0000 Received: by nf-out-0910.google.com with SMTP id d3so195866nfc.15 for ; Thu, 14 Feb 2008 02:25:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=94h8v8THAspBhlIDAfa+EfbgfqKjNd4O7piIEuOwPac=; b=Ix8aUxBYz0AMDZ+jEKeR6kif+hLJr+HYtakQn1H2gP3YgAuQCRxZqUfonkQTGfFLD4+AsW6SFMHoHBd1NAETFiHYz2Q19cDIgMD+tUzu2Uobs4wCWaZamJbz+pUteNtRLkNbG7aVvH3ulOuQd9FmtTtEUGwMDXRMmHKrEnSqelE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Gg74sKFaLbiZXAoZTbV20lCSv+VSdBzq5rAoE7DYD6j0YeDwHHeGhuxcBZTS6SiYzJmm9wtEsWlE1rQoWEZonbHixlyoYSozLvvQqe87ppwaaqiarvndWQJlwSy6Md0E11bdOlIneJSwKeyeM9jue9rhmN0Ac5OOS0mhgMRrvws= Received: by 10.82.105.13 with SMTP id d13mr409194buc.36.1202984717710; Thu, 14 Feb 2008 02:25:17 -0800 (PST) Received: by 10.82.145.2 with HTTP; Thu, 14 Feb 2008 02:25:17 -0800 (PST) Message-ID: <1bcb7c7f0802140225x2560a6a3tac6149c5e2c1923@mail.gmail.com> Date: Thu, 14 Feb 2008 12:25:17 +0200 From: "Cam Bazz" To: java-user@lucene.apache.org Subject: Re: matching products with suggest feature In-Reply-To: <786fde50802132244q3020da14xa3371f925cebc50e@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_601_386341.1202984717667" References: <1bcb7c7f0802130854s1c5a0bacjc377e9f64225089c@mail.gmail.com> <786fde50802130900j51be4706va1d32eb10b868852@mail.gmail.com> <1bcb7c7f0802131019q20bb7437pdd705eda6289f8c9@mail.gmail.com> <786fde50802132244q3020da14xa3371f925cebc50e@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_601_386341.1202984717667 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hello Shai, Thats right, Speller is in the contrib.it is named spellchecker. Basically it is a special index that stores the words as ngrams. I looked at the code to see how it is querying the index and basically it makes ngrams and adds each ngram to a boolean query. Here is how it adds to the boolean query. I could not find out whether it is AND or OR Best. private static void add(BooleanQuery q, String name, String value, float boost) { Query tq = new TermQuery(new Term(name, value)); tq.setBoost(boost); q.add(new BooleanClause(tq, BooleanClause.Occur.SHOULD)); } private static void add(BooleanQuery q, String name, String value) { q.add(new BooleanClause(new TermQuery(new Term(name, value)), BooleanClause.Occur.SHOULD)); } On Thu, Feb 14, 2008 at 8:44 AM, Shai Erera wrote: > Is this Speller class a Lucene class? I didn't find it in the main code > stream, maybe it's part of contrib? > > Anyway, still it depends how it is implemented (OR or AND). For example, > someone indexed a document with the word "abcde" and the index keeps the > ngrams "abc", "bcd" and "cde". Then somebody types in "abc", what would > the > speller suggest? What would the speller suggest for "abce"? > If it works in an OR mode, I assume it would suggest "abcde" for both, as > "abc" appears in both. But if it works in AND mode, then for the first it > will suggest "abcde" but for the second it won't suggest it because the > ngrams produced are "abc" and "bce" .. and "bce" does not appear in > "abcde". > > Am I right? If not, can you elaborate more on the Speller class you use? > > On Wed, Feb 13, 2008 at 8:19 PM, Cam Bazz wrote: > > > Hello Shai, > > > > The class that does the matching is Speller. > > It does not work query based but rather there is a method called - > > suggestSimilar(String word, int numSug); where the numSug is number of > > suggestions. The words are kept in the index as ngrams. For example > abcde > > is > > kept as abc bcd cde. > > So this is not normal query like we all know. > > > > Best regards, > > C.B. > > > > > > On Feb 13, 2008 7:00 PM, Shai Erera wrote: > > > > > What is the default Operator of your QueryParser? Is it AND_OPERATOR > or > > > OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then once > > you > > > add more terms than what exists, it won't find anything. > > > > > > On Feb 13, 2008 6:54 PM, Cam Bazz wrote: > > > > > > > Hello; > > > > > > > > I am trying to make a product matcher based on lucene's ngram based > > > > suggest. > > > > I did some changes so that instead of giving the speller a > dictionary > > I > > > > feed > > > > it with a List. > > > > > > > > For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600 > > > > 1.83GHz/512MB/80GB/12.1'' > > > > NOTEBOOK" > > > > and I index it with speller using an ngram approach. > > > > > > > > It works quite well - when using the suggest feature, for example if > > the > > > > user submits something similar. similar as in the string lenght is > > > > relatively equal, a word or two might be mistyped - or even missing, > > > > lucene > > > > finds it. > > > > However - when the user submits the same product - but with much > less > > or > > > > much more string length - for example "HP NC4400 EY605EA" or "HP > > NC4400 > > > > EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH > > WINDOWS > > > > XP > > > > AND GIFT MOUSE" - the suggester wont work. > > > > > > > > I am not sure about the ngrams approach any more. > > > > > > > > Any ideas/recomendations/help greatly appreciated. > > > > > > > > Best Regards, > > > > C.B. > > > > > > > > > > > > > > > > -- > > > Regards, > > > > > > Shai Erera > > > > > > > > > -- > Regards, > > Shai Erera > ------=_Part_601_386341.1202984717667--