lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: EarlyTerminatingSortingCollector help needed..
Date Sun, 22 Jun 2014 16:44:09 GMT
Thanks for your reply & clarifications

What do you mean by "When I use a SortField instead"? Unless you are
> using early termination, Collector.collect is supposed to be called
> for every matching document



For a normal sorting-query, on a top-level searcher, I execute

TopDocs docs = searcher.search(query, 50, sortField)

Then I can issue reader.document() for final list of exactly 50 docs, which
gives me a global order across segments but at the obvious cost of memory...

SortingMergePolicy + ETSC will make me do 50*N [N=no.of.segments] collects,
which could increase cost of seeks when each segment collects considerable
hits...

 - you can afford the merging overhead (ie. for heavy indexing
> workloads, this might not be the best solution)
>  - there is a single sort order that is used for most queries
>  - you don't need any feature that requires to collect all documents
> (like computing the total hit count or facets).


Our use-case fits perfectly on all these 3 points and thats why we wanted
to explore this. But our final set of results must also be globally
ordered. May be it's mistake to assume that Sorting can be entirely
replaced with SMP + ETSC...

I would not advise to use the stored fields API, even in the context
> of early termination. Doc values should be more efficient here?


I read your excellent blog on stored-fields compression, where you've
mentioned that stored-fields now take only one random seek. [
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
]

If so, then what could make DocValues still a winner?

--
Ravi


On Sat, Jun 21, 2014 at 6:41 PM, Adrien Grand <jpountz@gmail.com> wrote:

> Hi Ravikumar,
>
> On Fri, Jun 20, 2014 at 12:14 PM, Ravikumar Govindarajan
> <ravikumar.govindarajan@gmail.com> wrote:
> > If my "numDocsToCollect" = 50 and no.of. segments = 15, then
> > collector.collect() will be called 750 times.
>
> That is the worst-case indeed. However if some of your segments have
> less than 50 matches, `collect` will only be called on those matches.
>
> > When I use a SortField instead, then TopFieldDocs does the sorting for
> all
> > segments and collector.collect() will be called only 50 times...
>
> What do you mean by "When I use a SortField instead"? Unless you are
> using early termination, Collector.collect is supposed to be called
> for every matching document.
>
> > Assuming a stored-field seek for every collector.collect(), will it be
> > advisable to still persist with ETSC? Was it introduced as a trade-off
> b/n
> > memory & disk?
>
> I would not advise to use the stored fields API, even in the context
> of early termination. Doc values should be more efficient here?
>
> The trade-off is not really about memory and disk. What it tries to
> achieve is to make queries much faster provided that:
>  - you can afford the merging overhead (ie. for heavy indexing
> workloads, this might not be the best solution)
>  - there is a single sort order that is used for most queries
>  - you don't need any feature that requires to collect all documents
> (like computing the total hit count or facets).
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message