lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 21189] - Hits.length() returns to large value
Date Mon, 30 Jun 2003 16:35:40 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=21189>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=21189

Hits.length() returns to large value





------- Additional Comments From daniel.armbrust@mayo.edu  2003-06-30 16:35 -------
I did some more debugging... It appears that this is indeed a Lucene error.  I
didn't realize the id method was new in 1.3... and this method, getMoreDocs -
must not have been used prior to the introduction of id - because the method
getMoreDocs in the Hits class is broken.

When you call getMoreDocs, it never actually adds the new documents to the
hitDoc vector, because the for loop at the end has the wrong start point.


  private final void getMoreDocs(int min) throws IOException {
    if (hitDocs.size() > min)
      min = hitDocs.size();

    int n = min * 2;				  // double # retrieved
    TopDocs topDocs = searcher.search(query, filter, n);
    length = topDocs.totalHits;
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;

    float scoreNorm = 1.0f;
    if (length > 0 && scoreDocs[0].score > 1.0f)
      scoreNorm = 1.0f / scoreDocs[0].score;

    int end = scoreDocs.length < length ? scoreDocs.length : length;
    for (int i = hitDocs.size(); i < end; i++)
      hitDocs.addElement(new HitDoc(scoreDocs[i].score*scoreNorm,
				    scoreDocs[i].doc));
  }


I think (my knowlege of lucene isn't all that broad, so this may be incorrect)
that the for loop should look like this:
for (int i = 0; i < end; i++)   // - starting from i = 0;, not what it is above



But - I also just noticed a new problem - The scoring of this method will be
incorrect.  If any of the first 100 documents retrived score above 1.0, they
will be normalized down by X.  Then, when the next set of documents are
retrieved, the first one will mostlikely not score above 1.0 - so these
documents will not have their scored normalized.  Now some of the documents have
had their scores normalized, and some have not.  

My guess is that the normalization factor should be stored in the class
somewhere, to be used for all subsequent calls.

Maybe I should open a new bug report, or maybe a committer wants to kill two
birds with one stone.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message