lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Gokhale <saurabhgokh...@gmail.com>
Subject Re: Multiple Query clauses impacting result
Date Wed, 03 Aug 2011 16:02:48 GMT
Hi Uwe,

Thanks for clarifying and the link given by you does have a satisfactory
explanation.

So in a business scenario where we have to make a decision based on the
"accepted" matching of a document (say perform activity A only when a
document matches more than 50%), we wont be able to rely on the match score
because the score will change based on our query and some times 80% matching
may not be as close as 5% matching with a slightly different query. (I know
I am going back to  % again :)

So how do we handle such a scenario?


Thanks

Saurabh


On Wed, Aug 3, 2011 at 1:34 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi Saurabh,
>
>
>
> There is nothing wrong with Lucene, the problem is generally that you try
> to
> see scores as percentages, which they aren't. Scores are arbitrary values,
> only used for sorting search results, but never to compare results between
> different queries. It's in fact easy possible to also get back values >1.0.
>
> Your examples do the right thing, the sorting is the same in both cases.
> The
> actual score values are *arbitrary*!
>
>
>
> See  <http://wiki.apache.org/lucene-java/ScoresAsPercentages>
> http://wiki.apache.org/lucene-java/ScoresAsPercentages for explanation.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de <http://www.thetaphi.de/>
>
> eMail: uwe@thetaphi.de
>
>
>
> From: Saurabh Gokhale [mailto:saurabhgokhale@gmail.com]
> Sent: Wednesday, August 03, 2011 12:39 AM
> To: java-user@lucene.apache.org
> Subject: Multiple Query clauses impacting result
>
>
>
> Hi All,
>
>
>
> As I add new clauses to the Boolean Query, my queryNorm value goes down
> which is impacting the results.
>
>
>
>
>
>
>
> For example: (The complete stand alone application attached with the email.
> I am using Lucene 3.1.0)
>
>
>
> I indexed following 6 documents
>
>
>
> addDoc("author1", "My first book", "123"); --> 1st column == author name,
> 2nd = subject, 3rd column = isbn #
>
> addDoc("author2", "My next book", "333");
>
> addDoc("author2", "this first text", "444");
>
> addDoc("author3", "test the knowledge", "456");
>
> addDoc("author4", "knowledge is vertue", "789");
>
> addDoc("author5", "saurabh", "222");
>
>
>
> The Boolean Query given below generates following result:
>
>
>
> Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
>
> Match: 26.498592%  || Doc Author: author2 || Doc subject: My next book ||
> Doc ISBN: 333
>
> Match: 8.280809%  || Doc Author: author2 || Doc subject: this first text ||
> Doc ISBN: 444
>
>
>
> Now to this boolean Query if I add a new query, in this case a spannear
> Query with the search values which does not exists, my result percentage
> goes down.
>
>
>
> Query = (author:author1) (subject:book subject:first subject:my) -isbn:123
> spanNear([subject:not, subject:found], 3, true)
>
> Match: 9.584372%  || Doc Author: author2 || Doc subject: My next book ||
> Doc
> ISBN: 333
>
> Match: 2.995116%  || Doc Author: author2 || Doc subject: this first text ||
> Doc ISBN: 444
>
>
>
> Now the problem is, same documents which matched with 26 and 8 percentile
> in
> the first query result, now matched with 9 and 2 percentile. Ideally I do
> not expect any change in the result percentage as all my clauses are with
> Boolean OR parameter. But due to the queryNorm factor getting updated due
> to
> the addition of new clause, my result is getting impacted. (You can see the
> complete code in the attached java file)
>
>
>
> Now in a scenario where my job is to find if 100 special words (either
> single words or combination of multiple words) are present in the document
> or no, my result will go way down because not all documents will have those
> words and my queryNorm will be way low due to addition of 99 OR Boolean
> clauses.
>
>
>
> Is there a way I can get consistent result regardless of the OR clauses I
> add to my query? I mean is there a way I can control the queryNorm if this
> is what is the root cause?
>
>
>
> Thanks
>
>
>
> Saurabh
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message