lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: SpanQuery and Spans optimizations
Date Thu, 06 Aug 2009 20:39:14 GMT

On Aug 6, 2009, at 4:25 PM, Shai Erera wrote:

> But still you might collect spans for docs unnecessarily during  
> processing. If a doc is added to the PQ and later removed, then the  
> spans collection was just a waste of time (unless the collection  
> comes in free during query processing).

sure, but that is just the nature of the PQ, things get moved off.  We  
collect ScoreDocs right now, too, that get removed, too.  We  
presumably are only storing a few more bytes:  start (int), end (int)  
and payload (byte array, presumably small).

> Also, if you build a paging search UI, then as soon as the user  
> clicks "Next" for the first time, you'll collect the Spans for the  
> first 10 docs (10 is an example) unnecessarily, because they won't  
> be used.

Again, likewise for the ScoreDocs.

> I don't know if it makes sense, but how about if you execute the  
> query and get the top docs. Then you get the range of docs you need  
> (first 10, second 10). Then you sort the docs based on their  
> appearance in the spans. Then iterate on spans to collect them. You  
> can use just skipTo. You can then either sort back, or if you  
> optimize it, just return the docs in the TopDocs in the order they  
> appeared, but now w/ the spans. I'm sure you get the idea of what I  
> propose, even though I use too many words to describe it :).

Yes, this is what I do, but it involves jumping through hoops, etc.  
when it seems like during scoring we already had the info.  Again, I  
am likely willing to trade off the memory and some extra garbage (but  
not much, I suspect) for having to go through the Spans again.

You can also somewhat optimize the iterate over scoredocs case by  
asking whether the Spans.doc() is greater than the ScoreDoc.doc.  If  
it is, then you reset the Spans back to to the beginning and do a  
skipTo.  Not sure if this is faster than the sorting approach.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message