lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Heinen <ryan.hei...@elasticpath.com>
Subject Re: Bug in LuceneDictionary?
Date Mon, 20 Nov 2006 18:46:16 GMT
Yonik Seeley wrote:
> Thanks for investigating this Ryan!
> Could you open a JIRA bug and maybe provide a patch? (and a testcase
> reproducing the problem would be great too).

Will do. I've been busy the last few days, but hopefully will get around 
to it soon.

Ryan

> 
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
> 
> On 11/14/06, Ryan Heinen <ryan.heinen@elasticpath.com> wrote:
>> Hello,
>>
>> I believe that I may have discovered a bug in the spellchecker contrib,
>> specifically the LuceneDictionary (or SpellChecker, depending on how you
>> look at it) class.
>>
>> I noticed while doing some testing in my own code that when I was
>> running the indexDictionary method of the SpellChecker class it was
>> always missing the first term (alphabetically) of the field that I
>> specified.
>>
>> I did some investigating, and believe that I have determined the cause
>> of the issue.
>>
>> When its getWordsIterator() method is invoked, LuceneDictionary
>> instantiates a TermEnum by calling terms(new Field(field, "") on the
>> IndexReader that it is provided. (field = the name of the field supplied
>> to the LuceneDictionary)
>>
>> The LuceneDictionary.hasNext() method calls termEnum.next() to determine
>> whether or not there are more terms left in the TermEnum.
>>
>> Unfortunately, because terms(Field) returns a TermEnum of all terms
>> greater than the supplied term, the next biggest term is already set to
>> be the current term of the TermEnum. Thus, because
>> LuceneDictionary.hasNext() calls TermEnum.next() regardless of whether
>> or not the first term has been read, loops that use the following
>> structure, as the SpellChecker does, do have the expected results:
>>
>> while (iterator.hasNext()) {
>>         // obtain and do something with iterator.next();
>> }
>>
>> With data "abc", "def", "ghi", jkl" in the specified index & field, the
>> loop will only execute 3 times, with "def", "ghi", "jkl" being the only
>> values retrieved. One would expect that the loop should execute 4 times,
>> with all four values ("abc", "def", "ghi", jkl") showing up in the loop.
>>
>> Has anyone encountered this problem before? Am I missing something, or
>> should I report this as a bug?
>>
>> As far as I see it, the LuceneIterator should not be calling the next()
>> method of it's underlying TermEnum unless the next() method of the
>> LuceneIterator class is called.
>>
>> Any advice would be appreciated. I've appended some code below.
>>
>> Thanks,
>>
>> Ryan
>>
>> --------
>>
>> Here are a few lines from SpellChecker.java showing how it uses
>> LuceneDictionary's iterator:
>>
>> Iterator iter=dict.getWordsIterator();
>> while (iter.hasNext()) {
>>        String word=(String) iter.next();
>>        ...
>> }
>>
>> Below are the next() and hasNext() methods from LuceneDictionary.java
>>
>> public Object next() {
>>        if (!has_next_called) {
>>          hasNext();
>>        }
>>        has_next_called = false;
>>        return (actualTerm != null) ? actualTerm.text() : null;
>>      }
>>
>>
>>      public boolean hasNext() {
>>        has_next_called = true;
>>        try {
>>          // if there is still words
>>          if (!termEnum.next()) {
>>            actualTerm = null;
>>            return false;
>>          }
>>          //  if the next word are in the field
>>          actualTerm = termEnum.term();
>>          String fieldt = actualTerm.field();
>>          if (fieldt != field) {
>>            actualTerm = null;
>>            return false;
>>          }
>>          return true;
>>        } catch (IOException ex) {
>>          ex.printStackTrace();
>>          return false;
>>        }
>>      }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 


-- 
Ryan Heinen - Software Engineer
Elastic Path Software, Inc.

Phone   604 408 8078 ext 243
Fax     604 408 8079
E-mail  ryan.heinen@elasticpath.com
Web     http://www.elasticpath.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message