lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: Boolean Scorer
Date Sun, 12 Dec 2004 19:23:03 GMT
Daniel,

A perfectly reasonable request -- I'll put together a simple test case
but can't do it today.

The problem is with scoring -- nothing to do with and queries.

The test will run along these lines:
  1.  Use a custom similarity to eliminate all tf and idf effects, just
to isolate what is being tested.
  2.  Create two documents doc1 and doc2, each with two fields title and
description.  doc1 has "elephant" in title and "elephant" in
description.  doc2 has "elephant" in title and "albino" in description.
  3.  Express query for "albino elephant" against both fields.
Problems:
      a.  MultiFieldQueryParser won't recognize either document as
containing both terms, due to the way it expands the query across
fields.
      b.  Expressing query as "title:albino description:albino
title:elephant description:elephant" will score both documents
equivalently, since each matches two query terms.
  4.  Comparison to MaxDisjunctionQuery and my method for expanding
queries across fields.  Using notation that () represents a BooleanQuery
and {} represents a MaxDisjunctionQuery, "albino elephant" expands to:
        ( {title:albino description:albino}
          {title:elephant description:elephant} )
This will recognize that doc2 has both terms matched while doc1 only has
1 term matched, score doc2 over doc1.

Refinement note:  the actual expansion for "albino query" that I use is:
        ( {title:albino description:albino}~0.1
          {title:elephant description:elephant}~0.1 )
This causes the score of each MaxDisjunctionQuery to be the score of
highest scoring MDQ subclause plus 0.1 times the sum of the scores of
the other MDQ subclauses.  Thus, doc1 gets some credit for also having
"elephant" in the description but only 1/10 as much as doc2 gets for
covering another query term in its description.  If doc3 has "elephant"
in title and both "albino" and "elephant" in the description, then with
the actual refined expansion, it gets the highest score of all (whereas
with pure max, without the 0.1, it would get the same score as doc2).

In real apps, tf's and idf's also come into play of course, but can
affect these either way (i.e., mitigate this fundamental problem or
exacerbate it).

Chuck

  > -----Original Message-----
  > From: Daniel Naber [mailto:daniel.naber@t-online.de]
  > Sent: Sunday, December 12, 2004 2:24 AM
  > To: Lucene Developers List
  > Subject: Re: Boolean Scorer
  > 
  > On Sunday 12 December 2004 04:01, Chuck Williams wrote:
  > 
  > > I maintain the belief that max is *required* to implement
reasonable
  > > multi-field searching (1).
  > 
  > Could you give a small example -- preferably a test case -- that
shows
  > what
  > the problem is? I know it has been discussed before but I hadn't
been
  > able
  > to follow that discussion closely enough. I assume the problem is in
the
  > scoring, not in MultiFieldQueryParser. MultiFieldQueryParser has a
  > different problem, namely that it doesn't correctly work with AND
  > queries.
  > Or is that the issue you're talking about? Anyway, that will be
fixed
  > soon.
  > 
  > Regards
  >  Daniel
  > 
  > --
  > http://www.danielnaber.de
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message