lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuel García Martínez <samuelgmarti...@gmail.com>
Subject possible bug on Spellchecker
Date Wed, 20 Feb 2013 23:34:43 GMT
Hi all,

Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
Spellchecker) behaviour i think i found a bug when the input is a 6 letter
word:
  - george
  - anthem
  - argued
  - fluent

Due to the getMin() and getMax() the grams indexed for these terms are 3
and 4. So, the fields would be something like this:
  - for "*george*"
     - start3: "geo"
     - start4: "geor"
     - end3: "rge"
     - end4: "orge"
     - 3: "geo", "eor", "org", "rge"
     - 4: "geor", "eorg", "orge"
  - for "*anthem*"
     - start3: "ant"
     - start4: "anth"
     - end3: "tem"
     - end4: "them"

The problem shows up when the user swap 3rd a 4th characters, misspelling
the word like this:
  - geroge
  - anhtem

The queries generated for this terms are: (SHOULD boolean queries)
- for "*geroge*"
  - start3: "ger"
  - start4: "gero"
  - end3: "oge"
  - end4: "roge"
  - 3: "ger", "ero", "rog", "oge"
  - 4: "gero", "erog", "roge"
- for "*anhtem*"
  - start3: "anh"
  - start4: "anht"
  - end3: "tem"
  - end4: "htem"
  - 3: "anh", "nht", "hte", "tem"
  - 4: "anht", "nhte", "htem"

So, as you can see, this kind of misspelling never matches the suitable
suggestions although the edit distance is 0.95555556.

I think getMin(int l) and getMax(int l) should return 2 and 3,
respectively, for l==6. Debugging other values i did not found any problem
with any kind of misspelling.

Any thoughts about this?

-- 
Un saludo,
Samuel García

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message