lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Ordered span query with more than 2 subqueries: avoid?
Date Mon, 19 Apr 2004 21:06:44 GMT
Hi Doug,

On Monday 19 April 2004 20:56, Doug Cutting wrote:
> Paul Elschot wrote:
> > On Tuesday 06 April 2004 18:11, Doug Cutting wrote:
> >>I think this is indeed the problem.  Currently it always increments the
> >>earliest span.  Rather I think it should increment the first span, still
> >>within slop of the earliest span, that is out of order.  So, in your
> >
> > Yes, when the current match length and slop still allow.
> >
> >>example, when the spans are [w1 w3 w2], it should increment w3, since
> >>it's start is zero words after the end of w1 (slop is zero) but it is
> >>out of order: w2 is required after w1.  I think this rule generalizes to
> >>larger queries.
> >>
> >>Does this sound right?  If so, then I'll try to fix it.  I may not get
> >
> > It sounds right, but I'm not certain whether it generalizes to larger
> > queries.
> >
> > The question is: could incrementing the earliest span that is out
> > of order, but within allowed the slop, cause the search window to miss
> > the first ordered occurrence with the allowed slop at or after the
> > beginning of the current search window?
> >
> > I can't answer that question in a few minutes, so I'd rather
> > spend my time on programming the test case for now.
>
> Have you or anyone else had time to think more about this?  Does this
> sound like the appropriate fix?

Well, I submitted a patch on bugzilla, which implements this:
http://issues.apache.org/bugzilla/show_bug.cgi?id=28285

Since only non ordered span cells have next() called on them,
I think there will not be a miss in the ordered case.
I tried some more test cases to confirm this, and these passed.
I didn't send in a patch for these extra test cases, but I'll
attach it as tgz.

There are a few comment lines in firstNonOrderedNextToPartialList()
on performance. For long ordered matches, the patch needs to do many
conversions between partial list and queue. The idea is to build up
the partial ordered list until checkSlop() fails, and to avoid the
conversions:

        // FIXME: continue here, rename to eg. checkOrderedMatch():
        // when checkSlop() and not ordered, repeat cell.next().
        // when checkSlop() and ordered, add to list and repeat queue.pop()
        // without checkSlop(): no match, rebuild the queue from the partial list.
        // When queue is empty and checkSlop() and ordered there is a match.

In conclusion, I now think the patch is correct, but it will take 
a performance hit for longer ordered matches.

The patch calls checkSlop() a bit too much.
I considered subclassing to OrderedNearSpans to avoid
the inOrder check and to streamline the checkSlop()
calls, but I would like your opinion before doing
a larger refactoring like that.
The refactoring might also introduce a NonOrderedNearSpans,
and I don't feel really confident about doing a split into three of
this non trivial state machine.

Also I'd like to see some more created test cases
pass before evt. continuing on the code.
Anyone else willing to try and fry this?

Regards,
Paul

Mime
View raw message