Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 337C011AAD for ; Sun, 22 Jun 2014 16:44:38 +0000 (UTC) Received: (qmail 45360 invoked by uid 500); 22 Jun 2014 16:44:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 45295 invoked by uid 500); 22 Jun 2014 16:44:36 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 45283 invoked by uid 99); 22 Jun 2014 16:44:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Jun 2014 16:44:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ravikumar.govindarajan@gmail.com designates 209.85.212.174 as permitted sender) Received: from [209.85.212.174] (HELO mail-wi0-f174.google.com) (209.85.212.174) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Jun 2014 16:44:33 +0000 Received: by mail-wi0-f174.google.com with SMTP id bs8so2917420wib.13 for ; Sun, 22 Jun 2014 09:44:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ijkreD74EqafW1ErntUbCwzDftvSnelDZ10MNyRRvBU=; b=qfj/HusAxXTzWgqfEMkC3VXRXEign7KuAtbt8wmqchQcic5qmVbIZ2g/wW6Ed7BApp xeOQQO3+HRotkHEjr92A5Xth8jQ1Wih78nvENgKbdGKTKlHuYU2vgiGdfCxK0pwk8Gkp Mkl8rExqpjwreQ+h5PDoIlh6bc8wvRUawrgtHusn8loJLkFGWXNQPTPkPy/qy9KwC1tR oRYPfGPKuDMB3ewBKWOyMbjFwiAU84PMYMFqxzLr1fRvDxdf06HgtkR6xWAYnYK7CQkw qmJu6y4/BgiXHG5x8V/HMfXgvNXcBqhoLWtmFEkuHZhBlsx2Trcf61LM2kpdoa7uIOug 1BOw== MIME-Version: 1.0 X-Received: by 10.195.11.34 with SMTP id ef2mr3211476wjd.123.1403455449530; Sun, 22 Jun 2014 09:44:09 -0700 (PDT) Received: by 10.180.6.106 with HTTP; Sun, 22 Jun 2014 09:44:09 -0700 (PDT) In-Reply-To: References: Date: Sun, 22 Jun 2014 22:14:09 +0530 Message-ID: Subject: Re: EarlyTerminatingSortingCollector help needed.. From: Ravikumar Govindarajan To: "java-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=047d7b86dea68709f504fc6f6f8f X-Virus-Checked: Checked by ClamAV on apache.org --047d7b86dea68709f504fc6f6f8f Content-Type: text/plain; charset=UTF-8 Thanks for your reply & clarifications What do you mean by "When I use a SortField instead"? Unless you are > using early termination, Collector.collect is supposed to be called > for every matching document For a normal sorting-query, on a top-level searcher, I execute TopDocs docs = searcher.search(query, 50, sortField) Then I can issue reader.document() for final list of exactly 50 docs, which gives me a global order across segments but at the obvious cost of memory... SortingMergePolicy + ETSC will make me do 50*N [N=no.of.segments] collects, which could increase cost of seeks when each segment collects considerable hits... - you can afford the merging overhead (ie. for heavy indexing > workloads, this might not be the best solution) > - there is a single sort order that is used for most queries > - you don't need any feature that requires to collect all documents > (like computing the total hit count or facets). Our use-case fits perfectly on all these 3 points and thats why we wanted to explore this. But our final set of results must also be globally ordered. May be it's mistake to assume that Sorting can be entirely replaced with SMP + ETSC... I would not advise to use the stored fields API, even in the context > of early termination. Doc values should be more efficient here? I read your excellent blog on stored-fields compression, where you've mentioned that stored-fields now take only one random seek. [ http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1 ] If so, then what could make DocValues still a winner? -- Ravi On Sat, Jun 21, 2014 at 6:41 PM, Adrien Grand wrote: > Hi Ravikumar, > > On Fri, Jun 20, 2014 at 12:14 PM, Ravikumar Govindarajan > wrote: > > If my "numDocsToCollect" = 50 and no.of. segments = 15, then > > collector.collect() will be called 750 times. > > That is the worst-case indeed. However if some of your segments have > less than 50 matches, `collect` will only be called on those matches. > > > When I use a SortField instead, then TopFieldDocs does the sorting for > all > > segments and collector.collect() will be called only 50 times... > > What do you mean by "When I use a SortField instead"? Unless you are > using early termination, Collector.collect is supposed to be called > for every matching document. > > > Assuming a stored-field seek for every collector.collect(), will it be > > advisable to still persist with ETSC? Was it introduced as a trade-off > b/n > > memory & disk? > > I would not advise to use the stored fields API, even in the context > of early termination. Doc values should be more efficient here? > > The trade-off is not really about memory and disk. What it tries to > achieve is to make queries much faster provided that: > - you can afford the merging overhead (ie. for heavy indexing > workloads, this might not be the best solution) > - there is a single sort order that is used for most queries > - you don't need any feature that requires to collect all documents > (like computing the total hit count or facets). > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --047d7b86dea68709f504fc6f6f8f--