lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Ordered span query with more than 2 subqueries: avoid?
Date Tue, 06 Apr 2004 19:23:12 GMT
Doug,

On Tuesday 06 April 2004 18:11, Doug Cutting wrote:
> Paul Elschot wrote:
> > A test of the ordered span query with three terms:
> >    w1  w2  w3
> > and slop 1 against document:
> >    w1 w3 w2 w3
> > fails.
>
> Thanks for catching this.  It would be helpful if you could submit a
> JUnit test which tests this case.

I'll try.

> > The javadoc (1.4 rc3) of SpanNearQuery gives:
> >   Matches spans which are near one another. One can specify slop, the
> > maximum number of intervening unmatched positions, as well as whether
> > matches are required to be in-order.
> >
> > But the span search seems to scan the document from
> >
> > w1 w3 w2
> >
> > to
> >
> > w3 w2 w3
> >
> > instead of allowing for the slop to match w1 . w2 w3.
>
> I think this is indeed the problem.  Currently it always increments the
> earliest span.  Rather I think it should increment the first span, still
> within slop of the earliest span, that is out of order.  So, in your

Yes, when the current match length and slop still allow.

> example, when the spans are [w1 w3 w2], it should increment w3, since
> it's start is zero words after the end of w1 (slop is zero) but it is
> out of order: w2 is required after w1.  I think this rule generalizes to
> larger queries.
>
> Does this sound right?  If so, then I'll try to fix it.  I may not get

It sounds right, but I'm not certain whether it generalizes to larger
queries.

The question is: could incrementing the earliest span that is out
of order, but within allowed the slop, cause the search window to miss
the first ordered occurrence with the allowed slop at or after the
beginning of the current search window?

I can't answer that question in a few minutes, so I'd rather
spend my time on programming the test case for now.

(What was that joke again on a fool and wise men and questions?)

> to it for a few weeks however, since I'm busy this week and on vacation
> next week.
>
> > Anyway, does this mean that I should not use an ordered SpanNearQuery
> > with some slop with more than 2 subqueries?
>
> Until we fix this, yes.  Thanks for identifying this bug.

It's easy to work around, one only needs to nest some ordered span
queries with 2 subqueries each. This does not give exactly the same
behaviour, but it's good enough in practice.

> > I'm testing a parser for the span queries,  so posting self contained
> > test code would require some coding around that parser.
>
> Will you be able to contribute the parser?  It would be good to have a
> SpanQuery parser in Lucene, if it is general-purpose.

I would like to contribute the parser, and I hope I will be allowed to
do so. It is quite general, but not general purpose: the target
audience is power users. It does not use an analyzer and
there are is no default operator.
When/if the time comes I'll ask here on how to contribute.

> ...

> Thanks again,

My pleasure, have a good vacation.

Paul.

P.S. Only slightly off topic. Are you familiar with:
http://citeseer.ist.psu.edu/457664.html
Fast Algorithms for k-word Proximity Search (2001) 
Kunihiko SADAKANE . Hiroshi IMAI
It's about finding minimal intervals of k terms with arbitrary order.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message