lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Sevi <erans...@gmail.com>
Subject Re: SpanQuery and BoostingTermQuery oddities
Date Sun, 16 Aug 2009 09:29:44 GMT
I've managed to create some sort of solution to this problem -

The result is that we can get an equal score for a SpanOrQuery as a regular
BooleanQuery with only SHOULD clauses.
We can also get an equal score for a SpanNotQuery as a regular BooleanQuery
with only MUST clauses.

The good is that the score is calculated recursively and the boosts of the
inner queries are taken into account.
The bad in my solution is that the span distance is not taken into account
and that the spans are fetched for each sub query which can really affect
performance.

My solution is as follows:

1. Create a derived class for each "complex" span*Query that inherit from
SpanWeight (e.g. SpanNearWeight).
2. The new weight class is initialized with the SpanNearQuery and creates a
weight for each of the query's clauses - this gives us the recursive pass.
3. override the "SumOfSquaredWeights","Normalize" methods as the
BooleanWeight implementation.
4.  override the "Scorer" method as follows: create a BooleanScorer and add
the scorers from the weights of the sub queries. for SpanOrQuery add them as
not required and not prohibited. for SpanNearQuery add them as required and
not prohibited.
5. Override the "CreateWeight" method in the Span*Query to return the new
Weight class instead of the old SpanWeight class (the SpanWeight class will
still be returned for SpanTermQuery which doesn't contain any sub queries
and shouldn't be overriden).
6. optional - change the "SetFreqCurrentDoc" method in SpanScorer to sum the
freq in each doc instead of running SloppyFreq.

I hope you can understand the main idea from my complicated description.
The problem with the current spans implementation is that by the time you
have the spans you don't know how they were created - the span of a
complicated query or a simple query looks the same and treated the same.

With this method you can at least get a score for span queries which is not
the most accurate but at least take into account sub queries and boosts.
I haven't dealt with SpanNotQuery yet but I guess it can follow the same
base idea - create sub scorers with MUST and MUST_NOT for the
inclusive/exclusive sub queries of SpanNotQuery.

* I can attach my code for the above changes but I use Lucene.Net so the
code will be in c# and the Lucene version is 2.3.2.

Eran.
On Wed, Aug 12, 2009 at 1:25 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> All Span*Query seem to rely on the SpanQuery.createWeight (which
> returns SpanWeight/SpanScorer) to make their weight/scorer.
> SpanScorer in turn simply enumerates all spans summing up their
> "sloppy freq" and always scoring with that, regardless of the sub
> queries.
>
> So SpanNearQuery (or any composite span query, eg even SpanFirstQuery
> I think will do this), disregards the scores of its child query/ies.
>
> I agree it's odd... it seems like composite span queries ought to take
> their child query scoring into account.  This would be a benefit of
> merging into the normal Query*, since these composite queries already
> factor in scoring from their sub queries.
>
> Mike
>
> On Wed, Aug 5, 2009 at 11:01 AM, Mark Miller<markrmiller@gmail.com> wrote:
> > Grant Ingersoll wrote:
> >>
> >> On Aug 5, 2009, at 10:07 AM, Mark Miller wrote:
> >>>
> >>> Yeah - SpanQuery's don't use the boosts from subspans - it just uses
> the
> >>> idf for the query terms and the span length I believe - and the boost
> for
> >>> the top level Query.
> >>>
> >>> Is that the right way to go? I guess Doug seemed to think so? I don't
> >>> know. It is sort of a bug that lower boosts would be ignored right?
> There is
> >>> an issue for it somewhere.
> >>>
> >>> It gets complicated quick to change it - all of a sudden you need
> >>> something like BooleanQuery ...
> >>>
> >>
> >> Not sure it needs BooleanQuery, but it does seem like it should take
> into
> >> account the scores of the subclauses (regardless of BoostingTermQuery).
> >>  There is a spot in creating the SpanScorer where it gets the value from
> the
> >> QueryWeight, but this QueryWeight does not account for the subclauses
> >> QueryWeights.
> >>
> >
> > It doesn't need BooleanQuery - it needs BooleanQuery type logic - which
> is
> > fairly complicated. At least to do it right I think. I don't have a clear
> > memory of it, but I started to try and address this once and ...
> > well I didn't continue.
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message