lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Diviacco <>
Subject Re: comparing lucene scores across queries
Date Tue, 29 Mar 2011 08:31:26 GMT
hey Hoss,

thanks for your reply. I thought I've solved the issue according to Uwe, the
queries without coord function were reasonably comparable, but now you
actually reopened it.

So, I need to be sure I'm making them comparable and I would like to ask the

My BooleanQueries have similar structure. Important: they only contain
TermQueries. The fields are always 3 but the terms number can vary... this
is an example of BooleanQuery (sorry for the syntax):

field1:term1, SHOULD
field1:term2, SHOULD
field2:term1, SHOULD
field2:term2, SHOULD
field2:term3, SHOULD
field3:term1, SHOULD

If it is not clear how the BooleanQueries are, I can print some of them for
you. They have same number of fields but different number of terms.

1- Do you still think QueryNorm is not an issue ? Funny, because in the
documentation I can read:
QueryNorm(q) is a normalizing factor used to make scores between queries
comparable. This factor does not affect document ranking (since all ranked
documents are multiplied by the same factor), but rather just attempts to
make scores from different queries (or even different indexes) comparable.

It seems I can compare queries from the documentation.

2- I don't think I'm using queryBoosts, are they enabled by default in the
BooleanQuery ?

3- FieldNorm is not mentioned in Similarity class. How can I disable it ?
SHould I disable it ? Is it a issue ?

4-  If I'm not wrong Uwe told me I can compute comparable cosine
similarities even with documents of different length. Tf and Idf are
unbounded, and my docs have different length. Can't I measure the similarity
between query and doc vectors anyway ?

5 - Again, I've been told I can compare queries and from documentation, I
can see that queryNorm factor normalizes all queries. But you are saying I
should manually normalize them somehow ? It is not clear


> querynorm hsouldn't be a problem (since your booleanqueries all have hte
> same structure, and odn't use query boosts ... i assume) but field norm
> might be; i also don't see anything mentioned so far in this thread that
> describes how you'll work arround the tf and idf values being theretically
> unbounded (unless your docs are all of identical length)
> ultimatley, attempts at comparing scores across different searches all
> come down to normalizing (either explicitly or implicitly) and normalizing
> requires that you have a "max possible score" you can normalize relative
> to -- not just a "max score for the index", but a max score in the scope
> of all theretical documents (because otherwise the comparison isn't fair
> given an arbitrary corpus)
> with the default similarity, you can't really define a "max possible
> score" for a given query because tf and idf are not bounded functions.
> There have been a few nice discussions about this general concept over the
> years, here's the first once i found doing a quick search...
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message