lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rennie" <jren...@gmail.com>
Subject spellcheck: issues
Date Tue, 07 Oct 2008 14:56:12 GMT
Hello, I've been exploring usage of the spellcheck feature via solr 1.3.  I
have it working, but there are some issues I'm seeing that make it less
useful than it could be.  Response on the solr-user mailing list has been
limited.  I'm guessing the reason may be that I'm asking about issues which
are most relevant to the lucene codebase.  So, I hope you don't mind this
cross-posting.

I've noticed a few issues with spellcheck as I've been testing it out for
use on our site...

   1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a commit
   is going on and files are being rebuilt in the spellcheck data dir,
   spellcheck requests yield bogus answers.  I.e. I can issue identical
   requests and get drastically different answers.  The first time, I get
   suggestions and "correctlySpelled" is false.  The second time (during the
   commit), I get no suggestions and "correctlySpelled" is true.  Shouldn't
   spellcheck use the old index until the new one is ready for use, like solr
   does with optimizes?
   2. Inconsistent ordering - The first suggestion changes depending on the
   spellcheck.count that I specify.  If my query is "chanl" and I ask for one
   result, the suggestion is "chant" (freq. 16).  If I ask for 5 results, the
   first suggestion is also "chant"; the other 4 suggestions are less frequent
   (#2 is "chang", freq. 11).  However, if I ask for 10 results, the first
   suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and "chang"; #9
   is "chan" (freq. 174).  Shouldn't spellcheck always return the best
   suggestion first?  In my case, shouldn't "chanel" always top "chant" and
   "chang" since they all have the same edit distance yet "chanel" is two
   orders of mangnitude more popular?

Is there anything I could be doing wrong to create these problems?  If not,
are these known issues?  If not, should I create jira's for them?

Thanks,

Jason

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message