lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: SpanNearQuery's spans & payloads
Date Fri, 11 Sep 2009 19:34:08 GMT
I'd have to dig in to be of much help. Hard to remember this stuff.

0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k

 span 0 to 8
 span 1 to 8
 span 3 to 8
 span 6 to 8

I think those are the right 4. You start on the left and work right. Spans always start after
the last one started.

So first you would find: 0 to 8. After 0, 1 to 8.
After 1, 3 to 8, and after 3, 6 to 8. That makes sense.
You never see 9 because the 8 comes first and you can
end as many times on a pos as you want - but you dont
ever start a span at the same pos. So I think this is right.

The second question I am less sure about without looking at code.
I think its because each payload can only be loaded once. So the first
time you hit 0 to 8, you get both payloads - but every other span that
hits 8, that payload was already loaded ? So you get all of the payloads
you should, your just not duplicates in each span. I'd have to think
harder about it - but overall it appears right ... ? All the Spans
are subspans of a larger Span right?

- Mark



Michael McCandless wrote:
> Under LUCENE-1458, I'm hitting a curious test failure in
> TestPositionsIncrement.testPayloadsPos0.  The failure happens because
> the codec I'm testing (pulsing codec) allows you to retrieve the same
> payload more than once if the term was pulsed (inlined into terms
> dict), whereas w/ trunk you can only retrieve the payload once.
>
> But in debugging the failure, I'm struggling with what the correct
> behavior of SpanNearQuery really should be.
>
> The test creates a single doc with one analyzed field, with these
> single letter position:tokens:
>
>    0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>
> every token has a payload.
>
> Then it makes:
>
>   SpanNearQuery
>     SpanTermQuery term=a
>     SpanTermQuery term=k
>
> Term "a" occurs four times (positions 0, 1, 3, 6) and "k" occurs 2
> times (positions 7, 8).
>
> My first question is: what spans is SpanNearQuery supposed to
> enumerate?  Right now trunk does these four:
>
>    span 0 to 8
>    span 1 to 8
>    span 3 to 8
>    span 6 to 8
>
> which represents position 7 of "k" mated with all positions of "a".
> (remember end is 1+, so "k"'s position 7 turned into 8).  How come the
> position 8 occurrence of "k" was not included in any spans?
>
> My second question is: when you call getPayload() on each span, what
> should you get?  Right now trunk does this:
>
>     span 0 to 8
>       payload: pos: 0
>       payload: pos: 7
>     span 1 to 8
>       payload: pos: 0
>     span 3 to 8
>       payload: pos: 3
>     span 6 to 8
>       payload: pos: 6
>
> The first span properly includes the payload for "a" (pos: 0) and for
> "k" (pos: 7), but the the subsequent three do not include the payload
> for "k".  Shouldn't you get all payloads associated w/ the span?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message