As I add new clauses to the Boolean Query, my queryNorm value goes down which is impacting the results.
For example: (The complete stand alone application attached with the email. I am using Lucene 3.1.0)
I indexed following 6 documents
addDoc("author1", "My first book", "123"); --> 1st column == author name, 2nd = subject, 3rd column = isbn #
addDoc("author2", "My next book", "333");
addDoc("author2", "this first text", "444");
addDoc("author3", "test the knowledge", "456");
addDoc("author4", "knowledge is vertue", "789");
addDoc("author5", "saurabh", "222");
The Boolean Query given below generates following result:
Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
Match: 26.498592% || Doc Author: author2 || Doc subject: My next book || Doc ISBN: 333
Match: 8.280809% || Doc Author: author2 || Doc subject: this first text || Doc ISBN: 444
Now to this boolean Query if I add a new query, in this case a spannear Query with the search values which does not exists, my result percentage goes down.
Query = (author:author1) (subject:book subject:first subject:my) -isbn:123 spanNear([subject:not, subject:found], 3, true)
Match: 9.584372% || Doc Author: author2 || Doc subject: My next book || Doc ISBN: 333
Match: 2.995116% || Doc Author: author2 || Doc subject: this first text || Doc ISBN: 444
Now the problem is, same documents which matched with 26 and 8 percentile in the first query result, now matched with 9 and 2 percentile. Ideally I do not expect any change in the result percentage as all my clauses are with Boolean OR parameter. But due to the queryNorm factor getting updated due to the addition of new clause, my result is getting impacted. (You can see the complete code in the attached java file)
Now in a scenario where my job is to find if 100 special words (either single words or combination of multiple words) are present in the document or no, my result will go way down because not all documents will have those words and my queryNorm will be way low due to addition of 99 OR Boolean clauses.
Is there a way I can get consistent result regardless of the OR clauses I add to my query? I mean is there a way I can control the queryNorm if this is what is the root cause?