Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 91638 invoked from network); 12 Sep 2009 16:23:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Sep 2009 16:23:24 -0000 Received: (qmail 52397 invoked by uid 500); 12 Sep 2009 16:23:23 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 52317 invoked by uid 500); 12 Sep 2009 16:23:23 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 52309 invoked by uid 99); 12 Sep 2009 16:23:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 16:23:23 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 74.125.92.25 as permitted sender) Received: from [74.125.92.25] (HELO qw-out-2122.google.com) (74.125.92.25) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 16:23:13 +0000 Received: by qw-out-2122.google.com with SMTP id 9so664097qwb.53 for ; Sat, 12 Sep 2009 09:22:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=3PFvwFZaSp+P7Cx/VyXL1OpX/Ljxizje/6RUu4JEc4A=; b=d4+M/4IiF/x+DqwL1C4Ap8mfXUvoxJ4rDRUG0OhlSV9tx+XtVPtmuS93Ry0brNmHx1 IWNIwgXMIHVwUcZGU94BE6Djd3ZiDUwnXj4NqpVdjDN6/TafIc0yHQCHGBvioaFxH7+l y6YP9vPkGW97hymxzdBjzVvT4b1Xn2oyCJ060= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=YF6gT3gr5UQJjLsUWp64IhmvPy4yxlYMILuPy56KoyGBDe/HQxALRHHPbTjgEjcnwX I9hXw+QcEQ/tJs1zZyJz7OhOTTsl+Cd/iVwgNhjobBOEszmYk+OXlXTskq3x/l+c/NcP Aq3xAtdrjK8M1fB589I9ZuMxyMSz8FROnKh+Q= Received: by 10.224.96.100 with SMTP id g36mr3799461qan.384.1252772572158; Sat, 12 Sep 2009 09:22:52 -0700 (PDT) Received: from ?192.168.1.108? (ool-44c639d9.dyn.optonline.net [68.198.57.217]) by mx.google.com with ESMTPS id 6sm2402997qwk.41.2009.09.12.09.22.50 (version=SSLv3 cipher=RC4-MD5); Sat, 12 Sep 2009 09:22:51 -0700 (PDT) Message-ID: <4AABCADD.8090205@gmail.com> Date: Sat, 12 Sep 2009 12:22:53 -0400 From: Mark Miller User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: SpanNearQuery's spans & payloads References: <9ac0c6aa0909111132s69804fa5vbf5590ea6181ef7a@mail.gmail.com> <9ac0c6aa0909120212p3723cc58n4ac5ca18b9a0c1c3@mail.gmail.com> <4AAB96BC.6070107@gmail.com> <200909121807.54390.paul.elschot@xs4all.nl> In-Reply-To: <200909121807.54390.paul.elschot@xs4all.nl> X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Paul Elschot wrote: > On Saturday 12 September 2009 14:40:28 Mark Miller wrote: > > Michael McCandless wrote: > > > OK thanks for the responses. This is indeed tricky stuff! > > > > > > On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller > wrote: > > > > > > > > >> They start at the left and march right - each Span always starting > > >> after the last started, > > >> > > > > > > That's not quite always true -- eg I got span 1-8, twice, once I added > > > "b" as a clause to the SNQ. > > > > > Mmm - right - depends on how you look at it I think - it is less simple > > with terms at multiple positions, in that now each Span doesn't start > > in the *position* after the last - but if you line up the terms like you > > did, its still the same - the first 1 - 8 starts at the first term at > > pos 1, and > > the next 1 to 8 starts at the seconds term at pos 1. One starts after > > the other (though if you think Lucene positions, I realize they > virtually > > start at the same spot). > > > > > >> You might want exhaustive for highlighting as well - but its > > >> different algorithms ... > > >> > > > > > > Yeah, how we would represent spans for highlighting is tricky... we > > > had discussed this ("how to represent spans for aggregate queries") > > > recently, I think under LUCENE-1522. > > > > > > I think we'd have to return a tree structure, that mirrors the query's > > > tree structure, to hold the spans, rather than try to enumerate > > > ("denormalize") all possible expansions. Each leaf node would hold > > > actual data (position, term, payload, etc.), and then the tree nodes > > > would express how they are and/ord/near'd together. My app could then > > > walk the tree to compute any combination I wanted. > > > > > > > > >> In the end, I accepted my definition of works as - when I ask for > > >> the payloads back, will I end up with a bag of all the payloads that > > >> the Spans touched. I think you do. > > >> > > > > > > Yeah I think you do, except each payload is only returned once. So > > > it's only the first span that hits a payload that will return it. > > > > > > So it sounds like SNQ just isn't guaranteed to be exhaustive in how it > > > enumerates the spans, eg I'll never see that 2nd occurrence of "k", > > > nor its associated payload. > > > > > Not only not guaranteed, but its just not going to happen - its not > > how spans match. If I say find n within 300 of m with the following: > > > > n m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m > > m m m m m m m m m m m m > > > > Only the first m will match. It will start at the left, find the n, then > > say great, an m within 300, this doc matches, we are done. There is > > not another n to start on or finish on to the right. It doesn't then > > touch the next 300 m's - just they way Doug implemented them from what I > > can tell. Its only exhaustive from the > > left - find m within 300 of n, order matters (m first) > > > > m m m m m m m m m m m m m m m m m m n > > > > This will be a bunch of spans - start at the left - the first m to n > > matches, then the second m - n matches, then the third m to n matches, > > and so on as we move right. > > > In the ordered case that last one should only match once, against > the last m. > > > Regards, > Paul Elschot Good point - too lazy with my examples - shouldn't have said order matters :) The ordered NearSpan does appear to drop to the min from the left. It shrinks down to the short match - part of what makes it so hard to lazy load the payloads - you don't know each start point is not a match until its already moved on and then it might find a shorter one - in which case you have to dump the payload from the previous ... and so on. You can constantly be loading payloads that don't end up matching (though I think the unordered would consider them matches - even if they just happened to come in order). Unordered does not attempt to shrink the match like this and works as I said (I think - Paul's the Spans wizard). Ordered I think works on the same principle but will attempt to shrink to the smallest Span satisfying? -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org