lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@safaribooksonline.com>
Subject Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Date Thu, 15 Jan 2015 03:08:03 GMT
In practice, normalization by field length proves to be more useful than 
normalization by the sum of the lengths of all fields (document length), 
which I think is what you seem to be after.  Think of a book chapter 
document with two fields: title and full text.  It makes little sense to 
weight the terms in the title differently for longer and shorter texts.

To get the behavior (I think) you want, you could index your documents 
like this:

document1={field:"field1:term1 field1:term1"}
document2={field:"field1:term1 field2:term1"}

and form queries like:

query1="field:field1\:term1"
query2="field:(field1\:term1 or field2\:term1)"

-Mike

On 1/13/15 2:24 PM, danield wrote:
> Hi all,
>
> I have found, much to my dismay, that the documentation on Lucene’s default
> similarity formula is very dangerously misleading. See it here:
> http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf
>
> Term Frequency (TF) counts are expected to be per-document in the IR
> literature, and this documentation doesn’t say any differently. However, it
> turns out that for Lucene, TF scores are in fact PER-FIELD.
>
> This furthermore applies to the /coord/ component. I realise that /coord/ is
> a ratio of query terms matched over total query terms, but I believe an
> effort could be made to make clear that field1:term1 and field2:term1 count
> as 2 different query terms.
>
> As an example, for 2 documents with fields field1 and field2, where
> query1=”field1:term1”
> query2=”field1:term1 or field2:term1”
>
> document1={field1:”term1 term1”, field2:””}
> document2={field2:”term1”, field2:”term1”}
>
> Coord(query1,document1)= 1/1 = 1
> Coord(query2,document1)= 1/2 = 0.5
> Coord(query1,document2)= 1/2 = 0.5
> Coord(query2,document2)= 2/2 = 1
>
> Now, the TF scores will be normalized with the fieldNorm component which is
> computed based on field length at indexing time and stored in a single byte,
> with a significant loss of precision. These things together make it
> impossible to run Lucene retrieval in such a way that
>
> *similarity(query2,document1) == similarity(query2,document2)*
>
> which is precisely what I need in my use case.
>
> Here are my questions:
> 1. I think the documentation should be updated to make this clear! Can I do
> this myself?
> 2. Has anyone encountered this problem before? Is there an easy fix?
>
> Cheers,
> Daniel
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message