lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring
Date Thu, 21 Oct 2004 19:50:14 GMT
Daniel,

I haven't yet dealt with multiple indices, but will in the not-too-distant future, so this
sounds like a problem that will also be important to me.  I just briefly read through the
relevant code (e.g., MultiSearcher) to try to understand the issue.  My guess is the problem
arises from the fact that the separate indices have separately computed their tf's and idf's.
 This would imply that the searches against each index are completely separate searches. 
Since the current scoring does not produce scores that are comparable across separate searches,
the resorting of the hits in MultiSearcher.search() via the HitQueue would not accomplish
its intended effect.  This would lead to an incorrect final ranking.  Is that the problem
you are actually seeing?  If I've got it right, then yes, I believe what I'm proposing will
fix this too since it would make the scores coming back from the searches against the separate
indices directly comparable, causing the interleaving in MultiSearcher.search() to work properly.

However, I'm not sure this analysis is completely correct due to MultiSearcher.docFreq() which
appears to be trying to redefine the tf's to be the global value across all indices.  It wasn't
clear to me how this code is ever reached, e.g. from TermQuery --> SegmentTermDocs.  If
the tf's and idf's are in fact computed globally, then the interleaving should work as it
is, thus I'm guessing they are not.

This raises the question of the desired semantics.  Computing the tf's and idf's globally
seems right for apps that use multiple indices strictly for scalability reasons, while issuing
separate searches with properly-comparable but separate scoring on each seems right for meta-search.
 If the scalability case isn't working right (i.e., if MultiSeacher is not computing the tf's
and idf's across the entire collection of indices), fixing it would require a different approach
than what I've proposed.

If I've missed the actual problem entirely, please let me know.

Thanks,

Chuck

  > -----Original Message-----
  > From: Daniel Naber [mailto:daniel.naber@t-online.de]
  > Sent: Thursday, October 21, 2004 11:33 AM
  > To: Lucene Developers List
  > Subject: Re: Normalized Scoring -- was RE: idf and explain(), was Re:
  > Search and Scoring
  > 
  > On Thursday 21 October 2004 20:00, Chuck Williams wrote:
  > 
  > > Thanks Otis.  Other than trying to get some consensus a) that this is
  > a
  > > problem worth fixing, and b) on the best approach to fix it, my
  > central
  > > question is, if I fix it is it likely to get incorporated back into
  > > Lucene?
  > 
  > Chuck,
  > 
  > sorry, I also lack the time and knowledge to follow this discussion, but
  > what I consider a problem is that you currently cannot search over
  > several
  > indices without getting an incorrect ranking (except these indices were
  > built from splitting one large index). Is that also something you're
  > trying to solve?
  > 
  > Regards
  >  Daniel
  > 
  > --
  > http://www.danielnaber.de
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message