lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Multiple Query clauses impacting result
Date Wed, 03 Aug 2011 06:34:39 GMT
Hi Saurabh,

 

There is nothing wrong with Lucene, the problem is generally that you try to
see scores as percentages, which they aren't. Scores are arbitrary values,
only used for sorting search results, but never to compare results between
different queries. It's in fact easy possible to also get back values >1.0.

Your examples do the right thing, the sorting is the same in both cases. The
actual score values are *arbitrary*!

 

See  <http://wiki.apache.org/lucene-java/ScoresAsPercentages>
http://wiki.apache.org/lucene-java/ScoresAsPercentages for explanation.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Saurabh Gokhale [mailto:saurabhgokhale@gmail.com] 
Sent: Wednesday, August 03, 2011 12:39 AM
To: java-user@lucene.apache.org
Subject: Multiple Query clauses impacting result

 

Hi All,

 

As I add new clauses to the Boolean Query, my queryNorm value goes down
which is impacting the results.

 

 

 

For example: (The complete stand alone application attached with the email.
I am using Lucene 3.1.0)

 

I indexed following 6 documents 

 

addDoc("author1", "My first book", "123"); --> 1st column == author name,
2nd = subject, 3rd column = isbn #

addDoc("author2", "My next book", "333");

addDoc("author2", "this first text", "444");

addDoc("author3", "test the knowledge", "456");

addDoc("author4", "knowledge is vertue", "789");

addDoc("author5", "saurabh", "222");

 

The Boolean Query given below generates following result:

 

Query = (author:author1) (subject:book subject:first subject:my) -isbn:123

Match: 26.498592%  || Doc Author: author2 || Doc subject: My next book ||
Doc ISBN: 333

Match: 8.280809%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

 

Now to this boolean Query if I add a new query, in this case a spannear
Query with the search values which does not exists, my result percentage
goes down.

 

Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
spanNear([subject:not, subject:found], 3, true)

Match: 9.584372%  || Doc Author: author2 || Doc subject: My next book || Doc
ISBN: 333

Match: 2.995116%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

 

Now the problem is, same documents which matched with 26 and 8 percentile in
the first query result, now matched with 9 and 2 percentile. Ideally I do
not expect any change in the result percentage as all my clauses are with
Boolean OR parameter. But due to the queryNorm factor getting updated due to
the addition of new clause, my result is getting impacted. (You can see the
complete code in the attached java file)

 

Now in a scenario where my job is to find if 100 special words (either
single words or combination of multiple words) are present in the document
or no, my result will go way down because not all documents will have those
words and my queryNorm will be way low due to addition of 99 OR Boolean
clauses.

 

Is there a way I can get consistent result regardless of the OR clauses I
add to my query? I mean is there a way I can control the queryNorm if this
is what is the root cause?

 

Thanks

 

Saurabh


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message