lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Bug in LuceneDictionary?
Date Wed, 15 Nov 2006 02:24:07 GMT
Thanks for investigating this Ryan!
Could you open a JIRA bug and maybe provide a patch? (and a testcase
reproducing the problem would be great too).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

On 11/14/06, Ryan Heinen <ryan.heinen@elasticpath.com> wrote:
> Hello,
>
> I believe that I may have discovered a bug in the spellchecker contrib,
> specifically the LuceneDictionary (or SpellChecker, depending on how you
> look at it) class.
>
> I noticed while doing some testing in my own code that when I was
> running the indexDictionary method of the SpellChecker class it was
> always missing the first term (alphabetically) of the field that I
> specified.
>
> I did some investigating, and believe that I have determined the cause
> of the issue.
>
> When its getWordsIterator() method is invoked, LuceneDictionary
> instantiates a TermEnum by calling terms(new Field(field, "") on the
> IndexReader that it is provided. (field = the name of the field supplied
> to the LuceneDictionary)
>
> The LuceneDictionary.hasNext() method calls termEnum.next() to determine
> whether or not there are more terms left in the TermEnum.
>
> Unfortunately, because terms(Field) returns a TermEnum of all terms
> greater than the supplied term, the next biggest term is already set to
> be the current term of the TermEnum. Thus, because
> LuceneDictionary.hasNext() calls TermEnum.next() regardless of whether
> or not the first term has been read, loops that use the following
> structure, as the SpellChecker does, do have the expected results:
>
> while (iterator.hasNext()) {
>         // obtain and do something with iterator.next();
> }
>
> With data "abc", "def", "ghi", jkl" in the specified index & field, the
> loop will only execute 3 times, with "def", "ghi", "jkl" being the only
> values retrieved. One would expect that the loop should execute 4 times,
> with all four values ("abc", "def", "ghi", jkl") showing up in the loop.
>
> Has anyone encountered this problem before? Am I missing something, or
> should I report this as a bug?
>
> As far as I see it, the LuceneIterator should not be calling the next()
> method of it's underlying TermEnum unless the next() method of the
> LuceneIterator class is called.
>
> Any advice would be appreciated. I've appended some code below.
>
> Thanks,
>
> Ryan
>
> --------
>
> Here are a few lines from SpellChecker.java showing how it uses
> LuceneDictionary's iterator:
>
> Iterator iter=dict.getWordsIterator();
> while (iter.hasNext()) {
>        String word=(String) iter.next();
>        ...
> }
>
> Below are the next() and hasNext() methods from LuceneDictionary.java
>
> public Object next() {
>        if (!has_next_called) {
>          hasNext();
>        }
>        has_next_called = false;
>        return (actualTerm != null) ? actualTerm.text() : null;
>      }
>
>
>      public boolean hasNext() {
>        has_next_called = true;
>        try {
>          // if there is still words
>          if (!termEnum.next()) {
>            actualTerm = null;
>            return false;
>          }
>          //  if the next word are in the field
>          actualTerm = termEnum.term();
>          String fieldt = actualTerm.field();
>          if (fieldt != field) {
>            actualTerm = null;
>            return false;
>          }
>          return true;
>        } catch (IOException ex) {
>          ex.printStackTrace();
>          return false;
>        }
>      }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message