lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cam Bazz" <camb...@gmail.com>
Subject Re: matching products with suggest feature
Date Thu, 14 Feb 2008 10:25:17 GMT
Hello Shai,

Thats right, Speller is in the contrib.it is named spellchecker. Basically
it is a special index that stores the words as ngrams.
I looked at the code to see how it is querying the index and basically it
makes ngrams and adds each ngram to a boolean query.

Here is how it adds to the boolean query. I could not find out whether it is
AND or OR

Best.

  private static void add(BooleanQuery q, String name, String value, float
boost) {
    Query tq = new TermQuery(new Term(name, value));
    tq.setBoost(boost);
    q.add(new BooleanClause(tq, BooleanClause.Occur.SHOULD));
  }

  private static void add(BooleanQuery q, String name, String value) {
    q.add(new BooleanClause(new TermQuery(new Term(name, value)),
BooleanClause.Occur.SHOULD));
  }

On Thu, Feb 14, 2008 at 8:44 AM, Shai Erera <serera@gmail.com> wrote:

> Is this Speller class a Lucene class? I didn't find it in the main code
> stream, maybe it's part of contrib?
>
> Anyway, still it depends how it is implemented (OR or AND). For example,
> someone indexed a document with the word "abcde" and the index keeps the
> ngrams "abc", "bcd" and "cde". Then somebody types in "abc", what would
> the
> speller suggest? What would the speller suggest for "abce"?
> If it works in an OR mode, I assume it would suggest "abcde" for both, as
> "abc" appears in both. But if it works in AND mode, then for the first it
> will suggest "abcde" but for the second it won't suggest it because the
> ngrams produced are "abc" and "bce" .. and "bce" does not appear in
> "abcde".
>
> Am I right? If not, can you elaborate more on the Speller class you use?
>
> On Wed, Feb 13, 2008 at 8:19 PM, Cam Bazz <cambazz@gmail.com> wrote:
>
> > Hello Shai,
> >
> > The class that does the matching is Speller.
> > It does not work query based but rather there is a method called -
> > suggestSimilar(String word, int numSug); where the numSug is number of
> > suggestions. The words are kept in the index as ngrams. For example
> abcde
> > is
> > kept as abc bcd cde.
> > So this is not normal query like we all know.
> >
> > Best regards,
> > C.B.
> >
> >
> > On Feb 13, 2008 7:00 PM, Shai Erera <serera@gmail.com> wrote:
> >
> > > What is the default Operator of your QueryParser? Is it AND_OPERATOR
> or
> > > OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then once
> > you
> > > add more terms than what exists, it won't find anything.
> > >
> > > On Feb 13, 2008 6:54 PM, Cam Bazz <cambazz@gmail.com> wrote:
> > >
> > > > Hello;
> > > >
> > > > I am trying to make a product matcher based on lucene's ngram based
> > > > suggest.
> > > > I did some changes so that instead of giving the speller a
> dictionary
> > I
> > > > feed
> > > > it with a List<String>.
> > > >
> > > > For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600
> > > > 1.83GHz/512MB/80GB/12.1''
> > > > NOTEBOOK"
> > > > and I index it with speller using an ngram approach.
> > > >
> > > > It works quite well - when using the suggest feature, for example if
> > the
> > > > user submits something similar. similar as in the string lenght is
> > > > relatively equal, a word or two might be mistyped - or even missing,
> > > > lucene
> > > > finds it.
> > > > However - when the user submits the same product - but with much
> less
> > or
> > > > much more string length - for example "HP NC4400 EY605EA" or "HP
> > NC4400
> > > > EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH
> > WINDOWS
> > > > XP
> > > > AND GIFT MOUSE" - the suggester wont work.
> > > >
> > > > I am not sure about the ngrams approach any more.
> > > >
> > > > Any ideas/recomendations/help greatly appreciated.
> > > >
> > > > Best Regards,
> > > > C.B.
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Shai Erera
> > >
> >
>
>
>
> --
> Regards,
>
> Shai Erera
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message