lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Re: idf and explain(), was Re: Search and Scoring
Date Sat, 23 Oct 2004 16:18:54 GMT
Chuck Williams schrieb:
> Christoph, thanks for reading through my long postings and sharing your
> thoughts.  I had one comment in the first proposal email stating a
> conclusion to move away from cosine normalization, but I didn't share
> the reasons for this conclusion.  Please let me know if you agree with
> the following analysis.
> 
> I believe the central issue is the term sum(t) weight(t,d)^2, as Doug
> pointed out.  There seem to be two possible definitions for this term:
>   a) The sum extends over all the terms in the document
>   b) The sum just extends over the terms in the query

Damned, I was a little bit sloppy. Cosine-normalisation would of course
require the sum over all terms of a document (a) and Doug is right this
probably cannot be computed efficiently.

> So, cosine normalization looks like a loser to me.  I'm not an expert in
> this and may have the wrong analysis here.  Do you see flaws in the
> above?

No

> I continue to believe this is an important problem and am very
> appreciative that some others are digging into the issue.  My specific
> proposal has the benefit of not changing the score relationships
> relative to Lucene today and so is good from a backward-compatibility
> standpoint.  It is clearly better than the current normalization in
> Hits.  I think that setting the top score to its (net boost) / (total
> boost) is not too bad, although as indicated in the proposal this could
> be further refined in an attempt to also use other factors (tf, idf
> and/or length norm) in the setting of the top score. I'm not sure
> whether nor not using these additional factors in the normalization
> would be a good thing and would appreciate other thoughts.  (Remember
> that all factors will be used in the scoring -- the only question is
> which are important in setting the normalized top score.)
> 
> I don't see any way to address this issue through subclassing -- fixing
> it seems to require modifying Lucene source.  I'd rather not diverge
> from Lucene source, especially in so many fundamental classes, and so
> would like to see the changes incorporated back into Lucene.  Is that
> likely if I make the changes?

As far as the current normalization is concerned, I think you can "switch
it off" by using your own similarity implementation: E.g. make queryNorm and
coord return 1.0. I hope it's that simple :-)

So you should be able to implement your new normalization just by changing
the scorers and IndexSearcher.

I don't think that the changes on the scorers are so big. You just add a
new method for computing your netCoord, as far as I understand. So even if
your new scoring/normalization does not find it's way into Lucene, maybe
the changes on the scorers could.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message