lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: SpanQuery and Spans optimizations
Date Thu, 06 Aug 2009 21:09:55 GMT

On Aug 6, 2009, at 5:06 PM, Shai Erera wrote:

> Only w/ ScoreDocs we reuse the same instance. So I guess we'd like  
> to do the same here.
> Seems like providing a TopSpansCollector is what you want, only  
> unlike TopFieldCollector which populates the fields post search,  
> you'd like to do it during search.

Bingo, but I think the collection functionality needs to be on  
Collector, as I'd hate to have to lose out on functionality that the  
other impls have to offer, or have to recreate them.

> I've been typing and deleting suggestions for the past 5 minutes. I  
> guess it's late for me, so I'll sleep on it. sorry :)
> Shai
> On Thu, Aug 6, 2009 at 11:39 PM, Grant Ingersoll  
> <> wrote:
> On Aug 6, 2009, at 4:25 PM, Shai Erera wrote:
> But still you might collect spans for docs unnecessarily during  
> processing. If a doc is added to the PQ and later removed, then the  
> spans collection was just a waste of time (unless the collection  
> comes in free during query processing).
> sure, but that is just the nature of the PQ, things get moved off.   
> We collect ScoreDocs right now, too, that get removed, too.  We  
> presumably are only storing a few more bytes:  start (int), end  
> (int) and payload (byte array, presumably small).
> Also, if you build a paging search UI, then as soon as the user  
> clicks "Next" for the first time, you'll collect the Spans for the  
> first 10 docs (10 is an example) unnecessarily, because they won't  
> be used.
> Again, likewise for the ScoreDocs.
> I don't know if it makes sense, but how about if you execute the  
> query and get the top docs. Then you get the range of docs you need  
> (first 10, second 10). Then you sort the docs based on their  
> appearance in the spans. Then iterate on spans to collect them. You  
> can use just skipTo. You can then either sort back, or if you  
> optimize it, just return the docs in the TopDocs in the order they  
> appeared, but now w/ the spans. I'm sure you get the idea of what I  
> propose, even though I use too many words to describe it :).
> Yes, this is what I do, but it involves jumping through hoops, etc.  
> when it seems like during scoring we already had the info.  Again, I  
> am likely willing to trade off the memory and some extra garbage  
> (but not much, I suspect) for having to go through the Spans again.
> You can also somewhat optimize the iterate over scoredocs case by  
> asking whether the Spans.doc() is greater than the ScoreDoc.doc.  If  
> it is, then you reset the Spans back to to the beginning and do a  
> skipTo.  Not sure if this is faster than the sorting approach.
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message