lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Gokhale <saurabhgokh...@gmail.com>
Subject Multiple Query clauses impacting result
Date Tue, 02 Aug 2011 22:39:07 GMT
Hi All,

As I add new clauses to the Boolean Query, my queryNorm value goes down
which is impacting the results.



For example: (The complete stand alone application attached with the email.
I am using Lucene 3.1.0)

I indexed following 6 documents

addDoc("author1", "My first book", "123"); --> 1st column == author name,
2nd = subject, 3rd column = isbn #
addDoc("author2", "My next book", "333");
addDoc("author2", "this first text", "444");
addDoc("author3", "test the knowledge", "456");
addDoc("author4", "knowledge is vertue", "789");
addDoc("author5", "saurabh", "222");

The Boolean Query given below generates following result:

Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
Match: 26.498592%  || Doc Author: author2 || Doc subject: My next book ||
Doc ISBN: 333
Match: 8.280809%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

Now to this boolean Query if I add a new query, in this case a spannear
Query with the search values which does not exists, my result percentage
goes down.

Query = (author:author1) (subject:book subject:first subject:my)
-isbn:123 spanNear([subject:not,
subject:found], 3, true)
Match: 9.584372%  || Doc Author: author2 || Doc subject: My next book || Doc
ISBN: 333
Match: 2.995116%  || Doc Author: author2 || Doc subject: this first text ||
Doc ISBN: 444

Now the problem is, same documents which matched with 26 and 8 percentile in
the first query result, now matched with 9 and 2 percentile. Ideally I do
not expect any change in the result percentage as all my clauses are with
Boolean OR parameter. But due to the queryNorm factor getting updated due to
the addition of new clause, my result is getting impacted. (You can see the
complete code in the attached java file)

Now in a scenario where my job is to find if 100 special words (either
single words or combination of multiple words) are present in the document
or no, my result will go way down because not all documents will have those
words and my queryNorm will be way low due to addition of 99 OR Boolean
clauses.

Is there a way I can get consistent result regardless of the OR clauses I
add to my query? I mean is there a way I can control the queryNorm if this
is what is the root cause?

Thanks

Saurabh

Mime
View raw message