lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: determination of matching hits
Date Mon, 20 Dec 2004 17:01:57 GMT
Christian,

Please simplify your situation.  Use a plain TermQuery for "B" and see 
what is returned.  Then use a simple BooleanQuery for "A -B".  I 
suspect MultiFieldQueryParser is the culprit.  What does the toString 
of the generated Query return?  MFQP is known to be trouble, and an 
overhaul to it has been contributed recently.

	Erik

On Dec 20, 2004, at 10:32 AM, Christiaan Fluit wrote:

> Hello all,
>
> I have a question regarding the determination of the set of matching 
> documents, in particular (I guess) related to the NOT operator.
>
> In my case I have a document containing the terms A and B. When I 
> query for either A or for B, I get this document back, just as 
> expected. Now when I query for A -B, I once again get this document 
> back. In other words: this document matches both B and a query 
> containing the clause -B, which theoretically should never happen.
>
> I've seen this happen with various keywords, sometimes with multiple 
> "conflicting" documents. In each case, the B query returned the 
> document with a very low relevance (e.g. 0.007...).
>
> Based on these low relevancies and a quick peek in the Lucene code, I 
> strongly suspect that this is caused by rounding errors, as it seems 
> to me that floating point numbers are used to both express the 
> membership of a set as well as its score. Can somebody confirm this?
>
> And if this is the case, is there a workaround to eliminate or at 
> least significantly suppress this problem? A colleague mentioned 
> boosting every term in a query, would this solve anything?
>
> For most search engine-like applications, which order documents on 
> relevance, I think this problem is not a real issue since such 
> conflicting documents appear at the end of the result list and are not 
> likely to be seen by the user. However, in our case we have an 
> application which displays overlaps of entire result sets and these 
> documents show up very prominently (I can show screenshots if 
> desired). We have already been asked by customers to explain these 
> results :)
>
> FYI, in case it may be relevant: I'm still using Lucene 1.4.2. Every 
> document has the same set of five fields. The above queries are parsed 
> by MultiFieldQueryParser, using all five fields. I haven't touched the 
> default operator, but the queries A AND -B and A AND NOT B give the 
> same conflicting overlap in the result set.
>
>
> Thanks in advance,
>
> Christiaan Fluit
> Aduna.biz
> --
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message