lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <>
Subject Re: SpanQuery and Spans optimizations
Date Thu, 06 Aug 2009 21:06:16 GMT
Only w/ ScoreDocs we reuse the same instance. So I guess we'd like to do the
same here.

Seems like providing a TopSpansCollector is what you want, only unlike
TopFieldCollector which populates the fields post search, you'd like to do
it during search.

I've been typing and deleting suggestions for the past 5 minutes. I guess
it's late for me, so I'll sleep on it. sorry :)


On Thu, Aug 6, 2009 at 11:39 PM, Grant Ingersoll <>wrote:

> On Aug 6, 2009, at 4:25 PM, Shai Erera wrote:
>  But still you might collect spans for docs unnecessarily during
>> processing. If a doc is added to the PQ and later removed, then the spans
>> collection was just a waste of time (unless the collection comes in free
>> during query processing).
> sure, but that is just the nature of the PQ, things get moved off.  We
> collect ScoreDocs right now, too, that get removed, too.  We presumably are
> only storing a few more bytes:  start (int), end (int) and payload (byte
> array, presumably small).
>> Also, if you build a paging search UI, then as soon as the user clicks
>> "Next" for the first time, you'll collect the Spans for the first 10 docs
>> (10 is an example) unnecessarily, because they won't be used.
> Again, likewise for the ScoreDocs.
>> I don't know if it makes sense, but how about if you execute the query and
>> get the top docs. Then you get the range of docs you need (first 10, second
>> 10). Then you sort the docs based on their appearance in the spans. Then
>> iterate on spans to collect them. You can use just skipTo. You can then
>> either sort back, or if you optimize it, just return the docs in the TopDocs
>> in the order they appeared, but now w/ the spans. I'm sure you get the idea
>> of what I propose, even though I use too many words to describe it :).
> Yes, this is what I do, but it involves jumping through hoops, etc. when it
> seems like during scoring we already had the info.  Again, I am likely
> willing to trade off the memory and some extra garbage (but not much, I
> suspect) for having to go through the Spans again.
> You can also somewhat optimize the iterate over scoredocs case by asking
> whether the Spans.doc() is greater than the ScoreDoc.doc.  If it is, then
> you reset the Spans back to to the beginning and do a skipTo.  Not sure if
> this is faster than the sorting approach.
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message