lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingramcontent.com>
Subject RE: possible bug on Spellchecker
Date Thu, 21 Feb 2013 15:14:18 GMT
Samuel,

Do you think you could write a failing unit test and open a JIRA issue?  Or at the least open
a JIRA issue with all the details without a test?

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Samuel García Martínez [mailto:samuelgmartinez@gmail.com] 
Sent: Thursday, February 21, 2013 2:33 AM
To: java-user@lucene.apache.org
Subject: Re: possible bug on Spellchecker
Importance: Low

I'm using Solr 3.6 and DirectSpellchecker is available only on v4+.
Moreover, in "big" indexes i prefer using sidekick index rather than
iterating over term dictionary.


On Thu, Feb 21, 2013 at 8:19 AM, Jack Krupansky <jack@basetechnology.com>wrote:

> Any reason that you are not using the DirectSpellChecker?
>
> See:
> http://lucene.apache.org/core/**4_0_0/suggest/org/apache/**
> lucene/search/spell/**DirectSpellChecker.html<http://lucene.apache.org/core/4_0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Samuel García Martínez
> Sent: Wednesday, February 20, 2013 3:34 PM
> To: java-user@lucene.apache.org
> Subject: possible bug on Spellchecker
>
>
> Hi all,
>
> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
> Spellchecker) behaviour i think i found a bug when the input is a 6 letter
> word:
>  - george
>  - anthem
>  - argued
>  - fluent
>
> Due to the getMin() and getMax() the grams indexed for these terms are 3
> and 4. So, the fields would be something like this:
>  - for "*george*"
>
>     - start3: "geo"
>     - start4: "geor"
>     - end3: "rge"
>     - end4: "orge"
>     - 3: "geo", "eor", "org", "rge"
>     - 4: "geor", "eorg", "orge"
>  - for "*anthem*"
>
>     - start3: "ant"
>     - start4: "anth"
>     - end3: "tem"
>     - end4: "them"
>
> The problem shows up when the user swap 3rd a 4th characters, misspelling
> the word like this:
>  - geroge
>  - anhtem
>
> The queries generated for this terms are: (SHOULD boolean queries)
> - for "*geroge*"
>
>  - start3: "ger"
>  - start4: "gero"
>  - end3: "oge"
>  - end4: "roge"
>  - 3: "ger", "ero", "rog", "oge"
>  - 4: "gero", "erog", "roge"
> - for "*anhtem*"
>
>  - start3: "anh"
>  - start4: "anht"
>  - end3: "tem"
>  - end4: "htem"
>  - 3: "anh", "nht", "hte", "tem"
>  - 4: "anht", "nhte", "htem"
>
> So, as you can see, this kind of misspelling never matches the suitable
> suggestions although the edit distance is 0.95555556.
>
> I think getMin(int l) and getMax(int l) should return 2 and 3,
> respectively, for l==6. Debugging other values i did not found any problem
> with any kind of misspelling.
>
> Any thoughts about this?
>
> --
> Un saludo,
> Samuel García
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>
>


-- 
Un saludo,
Samuel García.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message