Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 57152 invoked from network); 5 Jul 2006 08:10:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 5 Jul 2006 08:10:26 -0000 Received: (qmail 57877 invoked by uid 500); 5 Jul 2006 08:10:19 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 57723 invoked by uid 500); 5 Jul 2006 08:10:18 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 57628 invoked by uid 99); 5 Jul 2006 08:10:17 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jul 2006 01:10:17 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [194.109.24.24] (HELO smtp-vbr4.xs4all.nl) (194.109.24.24) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jul 2006 01:10:16 -0700 Received: from k8l.lan (porta.xs4all.nl [80.127.24.69]) by smtp-vbr4.xs4all.nl (8.13.6/8.13.6) with ESMTP id k6589tbx059223 for ; Wed, 5 Jul 2006 10:09:55 +0200 (CEST) (envelope-from paul.elschot@xs4all.nl) From: Paul Elschot To: java-dev@lucene.apache.org Subject: Re: Flexible index format / Payloads Cont'd Date: Wed, 5 Jul 2006 10:09:53 +0200 User-Agent: KMail/1.8.2 References: <44A444A2.20003@gmail.com> <7661FA4F-18B0-4AC4-A506-D049CED50AC5@rectangular.com> <20060704215144.GA17439@fermat.math.technion.ac.il> In-Reply-To: <20060704215144.GA17439@fermat.math.technion.ac.il> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200607051009.54136.paul.elschot@xs4all.nl> X-Virus-Scanned: by XS4ALL Virus Scanner X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Tuesday 04 July 2006 23:51, Nadav Har'El wrote: ... > The problem is that Scorer, and it's implementations - BooleanScorer2, > DisjunctionSumScorer and ConjunctionScorer - only work on the document > level. Scorer has next() and skipTo(), but no way to view positions > inside the document. If you look at the lowest level Scorer, TermScorer, > it uses TermDocs and not TermPositions. > So I couldn't figure out a way to hack on BooleanScorer2 to change the > score by positions. > > Ok, then, I thought to myself - the normal queries and scorers only work > on the document level and don't use positions - but SpanQueries have positions > so I can create some sort of ProximityBooleanSpanQuery, right? Well, > unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer > as usual, and still doesn't have access to the positions. It does keep > "spans", and gives a score according to their lengths, but I couldn't > figure out how I could use this facility to do what we want. SpanQueries can be nested because they pass around Spans to higher levels for scoring at the top level of the proximity. At the bottom level there is SpanTermQuery, which uses the positions in the following way to create its Spans: public int doc() { return doc; } public int start() { return position; } public int end() { return position + 1; } For the index format, the most interesting thing is what is not present here: a weight per position. Also, there is some redundancy in start() and end() here, but this is the price of allowing nesting of SpanQueries. All other SpanQueries combine these into other Spans, normally with more distance between start() and end(). They also filter out the Spans that do not match the query, for example SpanNearQuery. The top level of the proximity query, a Spans is scored by SpanScorer, to give a score value per document. So a minimum form of "ProximityBooleanSpanQuery" is already there in Lucene. It is implemented by using a SpanScorer as a subscorer of a BooleanScorer2, and by having this SpansScorer use the proximity information passed up from the bottom level SpanTermQueries, normally via some other SpanQuery like SpanNearQuery. It might be possible to subclass Scorer to incorporate more position info, but SpanQueries have a slightly different take, they use Spans to pass the position info around. This is also the reason why Lucene has some difficulty in weighting the subqueries of a SpanQuery: unlike a Scorer, a Spans does not have a score or weight value, and SpanScorer is used to provide the score, but only at the top level of the proximity structure. This could be changed adding a weight to Spans, or by adding some form of position info to (a subclass of) Scorer. > > Lastly, I looked at what LoosePhraseScorer looks like, to understand how > phrases do get to use positions. It appears that this scorer gets initialized > with the TermPositions of each term, which includes the positions. This > is great, but it means that it a phrase can only contain terms (words) - > LoosePhraseScorer could not handle more complex sub-queries, and their PhraseScorers cannot be nested because they do not provide a Spans. However, they might be extended to provide a Spans, and this would be somewhat more efficient because the redundancy in start() and end() of the Spans of the SpanTermQueries would be avoided. > own Scorers. But it would have been nice if the proximity-enhanced boolean > query could support not just term sub-queries. How would you like the proximity information for nested proximity queries to to be passed around for scoring? Using Spans is one way, but there are more, especially when a weight per position becomes available. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org