lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: SpanNearQuery's spans & payloads
Date Sat, 12 Sep 2009 16:22:53 GMT
Paul Elschot wrote:
> On Saturday 12 September 2009 14:40:28 Mark Miller wrote:
> > Michael McCandless wrote:
> > > OK thanks for the responses. This is indeed tricky stuff!
> > >
> > > On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller
> <markrmiller@gmail.com> wrote:
> > >
> > >
> > >> They start at the left and march right - each Span always starting
> > >> after the last started,
> > >>
> > >
> > > That's not quite always true -- eg I got span 1-8, twice, once I added
> > > "b" as a clause to the SNQ.
> > >
> > Mmm - right - depends on how you look at it I think - it is less simple
> > with terms at multiple positions, in that now each Span doesn't start
> > in the *position* after the last - but if you line up the terms like you
> > did, its still the same - the first 1 - 8 starts at the first term at
> > pos 1, and
> > the next 1 to 8 starts at the seconds term at pos 1. One starts after
> > the other (though if you think Lucene positions, I realize they
> virtually
> > start at the same spot).
> > >
> > >> You might want exhaustive for highlighting as well - but its
> > >> different algorithms ...
> > >>
> > >
> > > Yeah, how we would represent spans for highlighting is tricky... we
> > > had discussed this ("how to represent spans for aggregate queries")
> > > recently, I think under LUCENE-1522.
> > >
> > > I think we'd have to return a tree structure, that mirrors the query's
> > > tree structure, to hold the spans, rather than try to enumerate
> > > ("denormalize") all possible expansions. Each leaf node would hold
> > > actual data (position, term, payload, etc.), and then the tree nodes
> > > would express how they are and/ord/near'd together. My app could then
> > > walk the tree to compute any combination I wanted.
> > >
> > >
> > >> In the end, I accepted my definition of works as - when I ask for
> > >> the payloads back, will I end up with a bag of all the payloads that
> > >> the Spans touched. I think you do.
> > >>
> > >
> > > Yeah I think you do, except each payload is only returned once. So
> > > it's only the first span that hits a payload that will return it.
> > >
> > > So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
> > > enumerates the spans, eg I'll never see that 2nd occurrence of "k",
> > > nor its associated payload.
> > >
> > Not only not guaranteed, but its just not going to happen - its not
> > how spans match. If I say find n within 300 of m with the following:
> >
> > n m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m
> > m m m m m m m m m m m m
> >
> > Only the first m will match. It will start at the left, find the n, then
> > say great, an m within 300, this doc matches, we are done. There is
> > not another n to start on or finish on to the right. It doesn't then
> > touch the next 300 m's - just they way Doug implemented them from what I
> > can tell. Its only exhaustive from the
> > left - find m within 300 of n, order matters (m first)
> >
> > m m m m m m m m m m m m m m m m m m n
> >
> > This will be a bunch of spans - start at the left - the first m to n
> > matches, then the second m - n matches, then the third m to n matches,
> > and so on as we move right.
>
>
> In the ordered case that last one should only match once, against
> the last m.
>
>
> Regards,
> Paul Elschot
Good point - too lazy with my examples - shouldn't have said order
matters :)

The ordered NearSpan does appear to drop to the min from the left. It
shrinks down to
the short match - part of what makes it so hard to lazy load the
payloads - you don't know
each start point is not a match until its already moved on and then it
might find a shorter one -
in which case you have to dump the payload from the previous ... and so
on. You can constantly
be loading payloads that don't end up matching (though I think the
unordered would consider them
matches - even if they just happened to come in order).

Unordered does not attempt to shrink the match like this and works as I
said (I think - Paul's the Spans wizard).

Ordered I think works on the same principle but will attempt to shrink
to the smallest Span satisfying?


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message