lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From danield <>
Subject Similarity formula documentation is misleading + how to make field-agnostic queries?
Date Tue, 13 Jan 2015 19:24:12 GMT
Hi all,

I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:

Term Frequency (TF) counts are expected to be per-document in the IR
literature, and this documentation doesn’t say any differently. However, it
turns out that for Lucene, TF scores are in fact PER-FIELD.

This furthermore applies to the /coord/ component. I realise that /coord/ is
a ratio of query terms matched over total query terms, but I believe an
effort could be made to make clear that field1:term1 and field2:term1 count
as 2 different query terms. 

As an example, for 2 documents with fields field1 and field2, where 
query2=”field1:term1 or field2:term1”

document1={field1:”term1 term1”, field2:””}
document2={field2:”term1”, field2:”term1”}

Coord(query1,document1)= 1/1 = 1
Coord(query2,document1)= 1/2 = 0.5
Coord(query1,document2)= 1/2 = 0.5
Coord(query2,document2)= 2/2 = 1

Now, the TF scores will be normalized with the fieldNorm component which is
computed based on field length at indexing time and stored in a single byte,
with a significant loss of precision. These things together make it
impossible to run Lucene retrieval in such a way that 

*similarity(query2,document1) == similarity(query2,document2)*

which is precisely what I need in my use case.

Here are my questions:
1. I think the documentation should be updated to make this clear! Can I do
this myself?
2. Has anyone encountered this problem before? Is there an easy fix?


View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message