Yonik Seeley wrote:
> Thanks for investigating this Ryan!
> Could you open a JIRA bug and maybe provide a patch? (and a testcase
> reproducing the problem would be great too).
Will do. I've been busy the last few days, but hopefully will get around
to it soon.
Ryan
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> On 11/14/06, Ryan Heinen <ryan.heinen@elasticpath.com> wrote:
>> Hello,
>>
>> I believe that I may have discovered a bug in the spellchecker contrib,
>> specifically the LuceneDictionary (or SpellChecker, depending on how you
>> look at it) class.
>>
>> I noticed while doing some testing in my own code that when I was
>> running the indexDictionary method of the SpellChecker class it was
>> always missing the first term (alphabetically) of the field that I
>> specified.
>>
>> I did some investigating, and believe that I have determined the cause
>> of the issue.
>>
>> When its getWordsIterator() method is invoked, LuceneDictionary
>> instantiates a TermEnum by calling terms(new Field(field, "") on the
>> IndexReader that it is provided. (field = the name of the field supplied
>> to the LuceneDictionary)
>>
>> The LuceneDictionary.hasNext() method calls termEnum.next() to determine
>> whether or not there are more terms left in the TermEnum.
>>
>> Unfortunately, because terms(Field) returns a TermEnum of all terms
>> greater than the supplied term, the next biggest term is already set to
>> be the current term of the TermEnum. Thus, because
>> LuceneDictionary.hasNext() calls TermEnum.next() regardless of whether
>> or not the first term has been read, loops that use the following
>> structure, as the SpellChecker does, do have the expected results:
>>
>> while (iterator.hasNext()) {
>> // obtain and do something with iterator.next();
>> }
>>
>> With data "abc", "def", "ghi", jkl" in the specified index & field, the
>> loop will only execute 3 times, with "def", "ghi", "jkl" being the only
>> values retrieved. One would expect that the loop should execute 4 times,
>> with all four values ("abc", "def", "ghi", jkl") showing up in the loop.
>>
>> Has anyone encountered this problem before? Am I missing something, or
>> should I report this as a bug?
>>
>> As far as I see it, the LuceneIterator should not be calling the next()
>> method of it's underlying TermEnum unless the next() method of the
>> LuceneIterator class is called.
>>
>> Any advice would be appreciated. I've appended some code below.
>>
>> Thanks,
>>
>> Ryan
>>
>> --------
>>
>> Here are a few lines from SpellChecker.java showing how it uses
>> LuceneDictionary's iterator:
>>
>> Iterator iter=dict.getWordsIterator();
>> while (iter.hasNext()) {
>> String word=(String) iter.next();
>> ...
>> }
>>
>> Below are the next() and hasNext() methods from LuceneDictionary.java
>>
>> public Object next() {
>> if (!has_next_called) {
>> hasNext();
>> }
>> has_next_called = false;
>> return (actualTerm != null) ? actualTerm.text() : null;
>> }
>>
>>
>> public boolean hasNext() {
>> has_next_called = true;
>> try {
>> // if there is still words
>> if (!termEnum.next()) {
>> actualTerm = null;
>> return false;
>> }
>> // if the next word are in the field
>> actualTerm = termEnum.term();
>> String fieldt = actualTerm.field();
>> if (fieldt != field) {
>> actualTerm = null;
>> return false;
>> }
>> return true;
>> } catch (IOException ex) {
>> ex.printStackTrace();
>> return false;
>> }
>> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--
Ryan Heinen - Software Engineer
Elastic Path Software, Inc.
Phone 604 408 8078 ext 243
Fax 604 408 8079
E-mail ryan.heinen@elasticpath.com
Web http://www.elasticpath.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|