lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: SpanNearQuery's spans & payloads
Date Sat, 12 Sep 2009 04:28:55 GMT
Michael McCandless wrote:
> Thanks Mark! -- comments below:
> On Fri, Sep 11, 2009 at 3:34 PM, Mark Miller <> wrote:
>> I'd have to dig in to be of much help. Hard to remember this stuff.
>> 0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>>  span 0 to 8
>>  span 1 to 8
>>  span 3 to 8
>>  span 6 to 8
>> I think those are the right 4. You start on the left and work
>> right. Spans always start after the last one started.
> OK, so SpanNearQuery always takes its left-most clause, releases a
> span, and then advances it?  What if there is a tie for two left-most
> clauses?
> Eg if I had included "b" as a clause, here, it'd tie with "a" at
> position 1 -- hmm, I just tested this: you get "span 1 to 8" twice:
>     span 0 to 8
>        payload: pos: 7
>        payload: pos: 1
>        payload: pos: 0
>     span 1 to 8
>        payload: pos: 0
>     span 1 to 8
>        payload: pos: 3
>     span 3 to 8
>        payload: pos: 6
>     span 6 to 8
>        payload: pos: 6
> Also, the payloads sort of shifted down (eg "pos: 3" now shows up in
> the "span 1 to 8" but before showed up in "span 3 to 8"), and "pos: 1"
> (for b) was added under "span 0 to 8".
> (NOTE: confusingly, the "payload: pos: N" is off by one, in this test,
> ie the "real" position is N+1).
>> So first you would find: 0 to 8. After 0, 1 to 8.
>> After 1, 3 to 8, and after 3, 6 to 8. That makes sense.
>> You never see 9 because the 8 comes first and you can
>> end as many times on a pos as you want - but you dont
>> ever start a span at the same pos. So I think this is right.
> I think (if I were using SpanNearQuery) I'd want it to somehow include
> 9, but I'm not quite sure how.  This test sets slop to 30, so maybe
> I'd want to see 0-9, 1-9, 3-9, 6-9?  Ie the "maximal" spans possible.
> EG my app will never see "k"'s payload from its occurrence at position
> 8.
You might want it, but thats not how Spans currently works - they are
not exhaustive.
They start at the left and march right - each Span always starting after
the last started,
but ending at the closest match. Its just how the query works, and so
when payloads was
grafted on ... they are made to match documents quickly - not enumerate
all matches in
a document (I guess).

You might want exhaustive for highlighting as well - but its different
algorithms ...
>> The second question I am less sure about without looking at code.
>> I think its because each payload can only be loaded once. So the first
>> time you hit 0 to 8, you get both payloads - but every other span that
>> hits 8, that payload was already loaded ? So you get all of the payloads
>> you should, your just not duplicates in each span. I'd have to think
>> harder about it - but overall it appears right ... ?
> Yeah that is the reason why you only see each payload once, but I'm
> not sure that's "right".  I guess an app can always store away each
> payload and pull it later, but eg it the app wants to score each span
> using the payloads from all occurrences of clauses within it, you
> can't trust getPayloads for that.
Fair enough - my idea of what appears right is tainted - I finished getting
NearSpansOrdered to work with payloads and I've fixed some bugs -
but I've never considered how it *should* work - I've just cursed and
moved on trying to get what we have to work.

In the end, I accepted my definition of works as - when I ask for the
back, will I end up with a bag of all the payloads that the Spans touched. I
think you do. If each sub Span duplicated payloads, they might be right for
some apps and it might be a pain for others right? You can't count on
the order
of the payloads or anything I think (been a while) - so its just like
getting a bag
back of those that matched.

Anyway - I'm not happy with a few things, but it was fairly hard just
getting things
to work at this level. I'd love for NearSpansOrdered to actually lazy
load the payloads
for one.
>> All the Spans are subspans of a larger Span right?
Sorry ;) I'm practicing with my chaotic brain so that one day I may
actually be half way clear.

I meant, all those Spans came from one query - so you got your bag of
payloads right? If each Span
was a separate entity, it would obviously be way wrong - but from a
single SpanQuery, at least you
got all the payloads in some form :)

I'd love to be able to give some more intelligent responses here, but
I'd have to dig back into the code
again first. Spans were hard enough to deal with without adding these
payloads to the mix :)
> Not sure what you mean here?
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

- Mark

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message