lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: SpanQuery and Spans optimizations
Date Thu, 06 Aug 2009 20:25:10 GMT
But still you might collect spans for docs unnecessarily during processing.
If a doc is added to the PQ and later removed, then the spans collection was
just a waste of time (unless the collection comes in free during query
processing).

Also, if you build a paging search UI, then as soon as the user clicks
"Next" for the first time, you'll collect the Spans for the first 10 docs
(10 is an example) unnecessarily, because they won't be used.

I don't know if it makes sense, but how about if you execute the query and
get the top docs. Then you get the range of docs you need (first 10, second
10). Then you sort the docs based on their appearance in the spans. Then
iterate on spans to collect them. You can use just skipTo. You can then
either sort back, or if you optimize it, just return the docs in the TopDocs
in the order they appeared, but now w/ the spans. I'm sure you get the idea
of what I propose, even though I use too many words to describe it :).

Shai

On Thu, Aug 6, 2009 at 9:40 PM, Grant Ingersoll <gsingers@apache.org> wrote:

>
> On Aug 6, 2009, at 2:31 PM, Paul Elschot wrote:
>
> With a single search one might end up collecting lots of span info
> that will be thrown away because the document score is too low.
>
>
> Presumably, you would only collect it if the result was actually put onto
> the PriorityQueue, in other words, after scoring that particular doc, so you
> would only be keeping Span values for the number of results requested.  I'd
> be willing to trade off that memory, I think, versus having to go
> iterate/skip all over Spans again.
>
>
> So I think the best way is to first collect the best hits in the usual
> way, and then get the spans of the query (effectively once more,
> but now without SpanScorer in between) with the doc numbers
> of the best hits as a filter while collecting all the begin/end positions.
>
>
> Yes, that is what I've traditionally done, but it is convoluted to
> associate it with a ranked list of docs.
>
>

Mime
View raw message