Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Thu, 21 Feb 2013 21:30:13 +0000 (UTC)
From: =?utf-8?Q?Samuel_Garc=C3=ADa_Mart=C3=ADnez_=28JIRA=29?=
 <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <JIRA.12633472.1361482001167.319618.1361482213113@arcas>
In-Reply-To: <JIRA.12633472.1361482001167@arcas>
References: <JIRA.12633472.1361482001167@arcas>
Subject: [jira] [Updated] (LUCENE-4793) Spellchecker don't find suggestion
 for concrete misspelled 6 letter words
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/LUCENE-4793?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

Samuel Garc=C3=ADa Mart=C3=ADnez updated LUCENE-4793:
-------------------------------------------

    Description:=20
Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene S=
pellchecker) behaviour i think i found a bug when the input is a 6 letter w=
ord:
  - george
  - anthem
  - argued
  - fluent

Due to the getMin() and getMax() the grams indexed for these terms are 3 an=
d 4. So, the fields would be something like this:
  - for "*george*"
     -- start3: "geo"
     -- start4: "geor"
     -- end3: "rge"
     -- end4: "orge"
     -- 3: "geo", "eor", "org", "rge"
     -- 4: "geor", "eorg", "orge"
  - for "*anthem*"
     -- start3: "ant"
     -- start4: "anth"
     -- end3: "tem"
     -- end4: "them"

The problem shows up when the user swap 3rd a 4th characters, misspelling t=
he word like this:
  - geroge
  - anhtem

The queries generated for this terms are: (SHOULD boolean queries)
- for "*geroge*"=20
  -- start3: "ger"
  -- start4: "gero"
  -- end3: "oge"
  -- end4: "roge"
  -- 3: "ger", "ero", "rog", "oge"
  -- 4: "gero", "erog", "roge"
- for "*anhtem*"
  -- start3: "anh"
  -- start4: "anht"
  -- end3: "tem"
  -- end4: "htem"
  -- 3: "anh", "nht", "hte", "tem"
  -- 4: "anht", "nhte", "htem"

So, as you can see, this kind of misspelling never matches the suitable sug=
gestions although the edit distance is 0.95555556.

I think getMin(int l) and getMax(int l) should return 2 and 3, respectively=
, for l=3D=3D6. Debugging other values i did not found any problem with any=
 kind of misspelling.

  was:
Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene S=
pellchecker) behaviour i think i found a bug when the input is a 6 letter w=
ord:
  - george
  - anthem
  - argued
  - fluent

Due to the getMin() and getMax() the grams indexed for these terms are 3 an=
d 4. So, the fields would be something like this:
  - for "george"
     -- start3: "geo"
     -- start4: "geor"
     -- end3: "rge"
     -- end4: "orge"
     -- 3: "geo", "eor", "org", "rge"
     -- 4: "geor", "eorg", "orge"
  - for "anthem"
     -- start3: "ant"
     -- start4: "anth"
     -- end3: "tem"
     -- end4: "them"

The problem shows up when the user swap 3rd a 4th characters, misspelling t=
he word like this:
  - geroge
  - anhtem

The queries generated for this terms are: (SHOULD boolean queries)
- for "geroge"=20
  -- start3: "ger"
  -- start4: "gero"
  -- end3: "oge"
  -- end4: "roge"
  -- 3: "ger", "ero", "rog", "oge"
  -- 4: "gero", "erog", "roge"
- for "anhtem"
  -- start3: "anh"
  -- start4: "anht"
  -- end3: "tem"
  -- end4: "htem"
  -- 3: "anh", "nht", "hte", "tem"
  -- 4: "anht", "nhte", "htem"

So, as you can see, this kind of misspelling never matches the suitable sug=
gestions although the edit distance is 0.95555556.

I think getMin(int l) and getMax(int l) should return 2 and 3, respectively=
, for l=3D=3D6. Debugging other values i did not found any problem with any=
 kind of misspelling.

   =20
> Spellchecker don't find suggestion for concrete misspelled 6 letter words
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-4793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4793
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/spellchecker
>    Affects Versions: 3.6, 4.0, 4.1
>            Reporter: Samuel Garc=C3=ADa Mart=C3=ADnez
>            Priority: Minor
>
> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene=
 Spellchecker) behaviour i think i found a bug when the input is a 6 letter=
 word:
>   - george
>   - anthem
>   - argued
>   - fluent
> Due to the getMin() and getMax() the grams indexed for these terms are 3 =
and 4. So, the fields would be something like this:
>   - for "*george*"
>      -- start3: "geo"
>      -- start4: "geor"
>      -- end3: "rge"
>      -- end4: "orge"
>      -- 3: "geo", "eor", "org", "rge"
>      -- 4: "geor", "eorg", "orge"
>   - for "*anthem*"
>      -- start3: "ant"
>      -- start4: "anth"
>      -- end3: "tem"
>      -- end4: "them"
> The problem shows up when the user swap 3rd a 4th characters, misspelling=
 the word like this:
>   - geroge
>   - anhtem
> The queries generated for this terms are: (SHOULD boolean queries)
> - for "*geroge*"=20
>   -- start3: "ger"
>   -- start4: "gero"
>   -- end3: "oge"
>   -- end4: "roge"
>   -- 3: "ger", "ero", "rog", "oge"
>   -- 4: "gero", "erog", "roge"
> - for "*anhtem*"
>   -- start3: "anh"
>   -- start4: "anht"
>   -- end3: "tem"
>   -- end4: "htem"
>   -- 3: "anh", "nht", "hte", "tem"
>   -- 4: "anht", "nhte", "htem"
> So, as you can see, this kind of misspelling never matches the suitable s=
uggestions although the edit distance is 0.95555556.
> I think getMin(int l) and getMax(int l) should return 2 and 3, respective=
ly, for l=3D=3D6. Debugging other values i did not found any problem with a=
ny kind of misspelling.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org