lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuel García Martínez <samuelgmarti...@gmail.com>
Subject Re: possible bug on Spellchecker
Date Thu, 21 Feb 2013 20:02:02 GMT
Yes, of course i can. I'll try to open it this night (European Time) or
tomorrow as soon as I get to the office.


On Thu, Feb 21, 2013 at 4:14 PM, Dyer, James
<James.Dyer@ingramcontent.com>wrote:

> Samuel,
>
> Do you think you could write a failing unit test and open a JIRA issue?
>  Or at the least open a JIRA issue with all the details without a test?
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Samuel García Martínez [mailto:samuelgmartinez@gmail.com]
> Sent: Thursday, February 21, 2013 2:33 AM
> To: java-user@lucene.apache.org
> Subject: Re: possible bug on Spellchecker
> Importance: Low
>
> I'm using Solr 3.6 and DirectSpellchecker is available only on v4+.
> Moreover, in "big" indexes i prefer using sidekick index rather than
> iterating over term dictionary.
>
>
> On Thu, Feb 21, 2013 at 8:19 AM, Jack Krupansky <jack@basetechnology.com
> >wrote:
>
> > Any reason that you are not using the DirectSpellChecker?
> >
> > See:
> > http://lucene.apache.org/core/**4_0_0/suggest/org/apache/**
> > lucene/search/spell/**DirectSpellChecker.html<
> http://lucene.apache.org/core/4_0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html
> >
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Samuel García Martínez
> > Sent: Wednesday, February 20, 2013 3:34 PM
> > To: java-user@lucene.apache.org
> > Subject: possible bug on Spellchecker
> >
> >
> > Hi all,
> >
> > Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
> > Spellchecker) behaviour i think i found a bug when the input is a 6
> letter
> > word:
> >  - george
> >  - anthem
> >  - argued
> >  - fluent
> >
> > Due to the getMin() and getMax() the grams indexed for these terms are 3
> > and 4. So, the fields would be something like this:
> >  - for "*george*"
> >
> >     - start3: "geo"
> >     - start4: "geor"
> >     - end3: "rge"
> >     - end4: "orge"
> >     - 3: "geo", "eor", "org", "rge"
> >     - 4: "geor", "eorg", "orge"
> >  - for "*anthem*"
> >
> >     - start3: "ant"
> >     - start4: "anth"
> >     - end3: "tem"
> >     - end4: "them"
> >
> > The problem shows up when the user swap 3rd a 4th characters, misspelling
> > the word like this:
> >  - geroge
> >  - anhtem
> >
> > The queries generated for this terms are: (SHOULD boolean queries)
> > - for "*geroge*"
> >
> >  - start3: "ger"
> >  - start4: "gero"
> >  - end3: "oge"
> >  - end4: "roge"
> >  - 3: "ger", "ero", "rog", "oge"
> >  - 4: "gero", "erog", "roge"
> > - for "*anhtem*"
> >
> >  - start3: "anh"
> >  - start4: "anht"
> >  - end3: "tem"
> >  - end4: "htem"
> >  - 3: "anh", "nht", "hte", "tem"
> >  - 4: "anht", "nhte", "htem"
> >
> > So, as you can see, this kind of misspelling never matches the suitable
> > suggestions although the edit distance is 0.95555556.
> >
> > I think getMin(int l) and getMax(int l) should return 2 and 3,
> > respectively, for l==6. Debugging other values i did not found any
> problem
> > with any kind of misspelling.
> >
> > Any thoughts about this?
> >
> > --
> > Un saludo,
> > Samuel García
> >
> > ------------------------------**------------------------------**---------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<
> java-user-unsubscribe@lucene.apache.org>
> > For additional commands, e-mail: java-user-help@lucene.apache.**org<
> java-user-help@lucene.apache.org>
> >
> >
>
>
> --
> Un saludo,
> Samuel García.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Un saludo,
Samuel García.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message