Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
From: Paul Elschot <paul.elschot@xs4all.nl>
To: java-dev@lucene.apache.org
Subject: Re: Flexible index format / Payloads Cont'd
Date: Wed, 5 Jul 2006 10:09:53 +0200
User-Agent: KMail/1.8.2
References: <44A444A2.20003@gmail.com>
 <7661FA4F-18B0-4AC4-A506-D049CED50AC5@rectangular.com>
 <20060704215144.GA17439@fermat.math.technion.ac.il>
In-Reply-To: <20060704215144.GA17439@fermat.math.technion.ac.il>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200607051009.54136.paul.elschot@xs4all.nl>

On Tuesday 04 July 2006 23:51, Nadav Har'El wrote:
...
> The problem is that Scorer, and it's implementations - BooleanScorer2,
> DisjunctionSumScorer and ConjunctionScorer - only work on the document
> level. Scorer has next() and skipTo(), but no way to view positions
> inside the document. If you look at the lowest level Scorer, TermScorer,
> it uses TermDocs and not TermPositions.
> So I couldn't figure out a way to hack on BooleanScorer2 to change the
> score by positions.
> 
> Ok, then, I thought to myself - the normal queries and scorers only work
> on the document level and don't use positions - but SpanQueries have 
positions
> so I can create some sort of ProximityBooleanSpanQuery, right? Well,
> unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer
> as usual, and still doesn't have access to the positions. It does keep
> "spans", and gives a score according to their lengths, but I couldn't
> figure out how I could use this facility to do what we want.

SpanQueries can be nested because they pass around
Spans to higher levels for scoring at the top level of the proximity.
At the bottom level there is SpanTermQuery, which uses the positions
in the following way to create its Spans:

        public int doc() { return doc; }
        public int start() { return position; }
        public int end() { return position + 1; }

For the index format, the most interesting thing is what is not present
here: a weight per position.
Also, there is some redundancy in start() and end() here, but this is the
price of allowing nesting of SpanQueries.
All other SpanQueries combine these into other Spans, normally
with more distance between start() and end(). They also filter out
the Spans that do not match the query, for example SpanNearQuery.
The top level of the proximity query, a Spans is scored by SpanScorer,
to give a score value per document. 

So a minimum form of "ProximityBooleanSpanQuery" is already there
in Lucene. It is implemented by using a SpanScorer as a subscorer
of a BooleanScorer2, and by having this SpansScorer use the proximity
information passed up from the bottom level SpanTermQueries, normally
via some other SpanQuery like SpanNearQuery.

It might be possible to subclass Scorer to incorporate more position info,
but SpanQueries have a slightly different take, they use Spans to pass 
the position info around.
This is also the reason why Lucene has some difficulty in weighting
the subqueries of a SpanQuery: unlike a Scorer, a Spans does not have
a score or weight value, and SpanScorer is used to provide the score, but
only at the top level of the proximity structure.
This could be changed adding a weight to Spans, or by adding some
form of position info to (a subclass of) Scorer.

> 
> Lastly, I looked at what LoosePhraseScorer looks like, to understand how
> phrases do get to use positions. It appears that this scorer gets 
initialized
> with the TermPositions of each term, which includes the positions. This
> is great, but it means that it a phrase can only contain terms (words) -
> LoosePhraseScorer could not handle more complex sub-queries, and their

PhraseScorers cannot be nested because they do not provide a Spans.
However, they might be extended to provide a Spans, and this would be
somewhat more efficient because the redundancy in start() and end() of
the Spans of the SpanTermQueries would be avoided.

> own Scorers. But it would have been nice if the proximity-enhanced boolean
> query could support not just term sub-queries.

How would you like the proximity information for nested proximity queries to
to be passed around for scoring?
Using Spans is one way, but there are more, especially when a weight
per position becomes available.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org