lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: SpanNearQuery's spans & payloads
Date Sat, 12 Sep 2009 12:48:54 GMT
In other words, Spans is guaranteed to find a document *if* a set of
terms match the positional constraints - if bush is within 20 of george,
its guaranteed to find that - but it doesn't give any concern to finding
every george within 20 of bush (though it may find multiple, or even
all of them depending on how the text is setup and the query constraints).

Mark Miller wrote:
> Michael McCandless wrote:
>   
>> OK thanks for the responses.  This is indeed tricky stuff!
>>
>> On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller <markrmiller@gmail.com> wrote:
>>
>>   
>>     
>>> They start at the left and march right - each Span always starting
>>> after the last started,
>>>     
>>>       
>> That's not quite always true -- eg I got span 1-8, twice, once I added
>> "b" as a clause to the SNQ.
>>   
>>     
> Mmm - right - depends on how you look at it I think - it is less simple
> with terms at multiple positions, in that now each Span doesn't start
> in the *position* after the last - but if you line up the terms like you
> did, its still the same - the first 1 - 8 starts at the first term at
> pos 1, and
> the next 1 to 8 starts at the seconds term at pos 1. One starts after
> the other (though if you think Lucene positions, I realize they virtually
> start at the same spot).
>   
>>   
>>     
>>> You might want exhaustive for highlighting as well - but its
>>> different algorithms ...
>>>     
>>>       
>> Yeah, how we would represent spans for highlighting is tricky... we
>> had discussed this ("how to represent spans for aggregate queries")
>> recently, I think under LUCENE-1522.
>>
>> I think we'd have to return a tree structure, that mirrors the query's
>> tree structure, to hold the spans, rather than try to enumerate
>> ("denormalize") all possible expansions.  Each leaf node would hold
>> actual data (position, term, payload, etc.), and then the tree nodes
>> would express how they are and/ord/near'd together.  My app could then
>> walk the tree to compute any combination I wanted.
>>
>>   
>>     
>>> In the end, I accepted my definition of works as - when I ask for
>>> the payloads back, will I end up with a bag of all the payloads that
>>> the Spans touched. I think you do.
>>>     
>>>       
>> Yeah I think you do, except each payload is only returned once.  So
>> it's only the first span that hits a payload that will return it.
>>
>> So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
>> enumerates the spans, eg I'll never see that 2nd occurrence of "k",
>> nor its associated payload.
>>   
>>     
> Not only not guaranteed, but its just not going to happen - its not
> how spans match. If I say find n within 300 of m with the following:
>
> n m m m m m m m m m m m m  m m m m m m m m m m m m m m m m m m m m m m
> m  m m m m m m m m m m m
>
> Only the first m will match. It will start at the left, find the n, then
> say great, an m within 300, this doc matches, we are done. There is
> not another n to start on or finish on to the right. It doesn't then
> touch the next 300 m's - just they way Doug implemented them from what I
> can tell. Its only exhaustive from the
> left - find m within 300 of n, order matters (m first)
>
> m m m m m m m m m m m m m m m m m m n
>
> This will be a bunch of spans - start at the left - the first m to n
> matches, then the second m - n matches, then the third m to n matches,
> and so on as we move right.
>   
>> For now I'll just match this behavior ("can only load payload once")
>> in all codecs in LUCENE-1458... the test passes again once I do that.
>>
>>   
>>     
>>> I meant, all those Spans came from one query - so you got your bag
>>> of payloads right? If each Span was a separate entity, it would
>>> obviously be way wrong - but from a single SpanQuery, at least you
>>> got all the payloads in some form :)
>>>     
>>>       
>> Right, this is all one query... but the payload for the 2nd
>> occurrence of "k" was never included in any span so I didn't get "all"
>> payloads.
>>   
>>     
> You got all the payloads the query matched - I think you need a
> different query (or
> we change the Spans algorithm completely)
>   
>> Maybe if/once we incorporate spans into Lucene's normal queries
>> (optionally, so there's no performance hit if you don't ask for them)
>> we can re-visit these issues.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>   
>>     
>
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message