Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 95A69113FB for ; Tue, 17 Jun 2014 12:41:52 +0000 (UTC) Received: (qmail 93896 invoked by uid 500); 17 Jun 2014 12:41:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 93836 invoked by uid 500); 17 Jun 2014 12:41:51 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 93823 invoked by uid 99); 17 Jun 2014 12:41:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jun 2014 12:41:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ravikumar.govindarajan@gmail.com designates 74.125.82.51 as permitted sender) Received: from [74.125.82.51] (HELO mail-wg0-f51.google.com) (74.125.82.51) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jun 2014 12:41:44 +0000 Received: by mail-wg0-f51.google.com with SMTP id x12so6845244wgg.34 for ; Tue, 17 Jun 2014 05:41:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ZkubikUiDKIkFAIZYkY+6DFTuTnTT7UWEkBQydFVNUg=; b=SFz2OvwQY3wyTs6BUZqZMEralhv295UF/bsRrVzMk4OvbKimwTJkKeBLnM3myvh54Z i4yz35AeR2lVrhmAtpHwNEoiyyh2xN8+6g49fQzVt9eIh7+06TvcojI05/noJLQk0BiA oJtHAJU1Tlg7MdiyflUMDHEXg+6yow1ADy0jx9yua2FoRK75fF3kYgqrRmwnka/Bygul NrMGu66bhiIhYNLuHp6WVPvW/4ToBJP1XnkxiksZZeTmUsN30xVJINvJjqfxBB+gm+h+ vUhn3AIrrNVVbscQaAZ57lhv7TL/lRYYIqG1mWtzVIT5WPn10K1M/5J+XBkJaEJ2DvTm 11MA== MIME-Version: 1.0 X-Received: by 10.194.173.7 with SMTP id bg7mr37234805wjc.3.1403008880779; Tue, 17 Jun 2014 05:41:20 -0700 (PDT) Received: by 10.180.6.106 with HTTP; Tue, 17 Jun 2014 05:41:20 -0700 (PDT) In-Reply-To: References: Date: Tue, 17 Jun 2014 18:11:20 +0530 Message-ID: Subject: Re: SortingMergePolicy for already sorted segments From: Ravikumar Govindarajan To: "java-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=089e013c62d4f4ae8704fc07755d X-Virus-Checked: Checked by ClamAV on apache.org --089e013c62d4f4ae8704fc07755d Content-Type: text/plain; charset=UTF-8 > > Therefore the DocMap is initialized only when the > merge actually executes ... what is there more to postpone? Agreed. However, what I am asking is, if there is an alternative to DocMap, will that be better? Plz read-on And besides, if the segments are already sorted, you should return a null DocMap, > like Lucene code does ... What I am trying to say is, my individual segments are sorted. However, when a merge combines "N" individual sorted-segments, there needs to be a global sort-order for writing the new segment. Passing null DocMap won't work here, no? DocMap is one-way of bringing the global order during a merge. Another way is to use something like a MergedIterator instead of DocMap, which doesn't need any memory I was trying to get a heads-up on these 2 approaches. Please do let me know if I have understood correctly -- Ravi On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera wrote: > > > > I am afraid the DocMap still maintains doc-id mappings till merge and I > am > > trying to avoid it... > > > > What do you mean 'till merge'? The method OneMerge.getMergeReaders() is > called only when the merge is executed, not when the MergePolicy decided to > merge those segments. Therefore the DocMap is initialized only when the > merge actually executes ... what is there more to postpone? > > And besides, if the segments are already sorted, you should return a null > DocMap, like Lucene code does ... > > If I miss your point, I'd appreciate if you can point me to a code example, > preferably in Lucene source, which demonstrates the problem. > > Shai > > > On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan < > ravikumar.govindarajan@gmail.com> wrote: > > > I am afraid the DocMap still maintains doc-id mappings till merge and I > am > > trying to avoid it... > > > > I think lucene itself has a MergeIterator in o.a.l.util package. > > > > A MergePolicy can wrap a simple MergeIterator for iterating docs across > > different AtomicReaders in correct sort-order for a given field/term > > > > That should be fine right? > > > > -- > > Ravi > > > > -- > > Ravi > > > > > > On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera wrote: > > > > > loadSortTerm is your method right? In the current Sorter.sort > > > implementation, I see this code: > > > > > > boolean sorted = true; > > > for (int i = 1; i < maxDoc; ++i) { > > > if (comparator.compare(i-1, i) > 0) { > > > sorted = false; > > > break; > > > } > > > } > > > if (sorted) { > > > return null; > > > } > > > > > > Perhaps you can write similar code? > > > > > > Also note that the sorting interface has changed, I think in 4.8, and > now > > > you don't really need to implement a Sorter, but rather pass a > SortField, > > > if that works for you. > > > > > > Shai > > > > > > > > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan < > > > ravikumar.govindarajan@gmail.com> wrote: > > > > > > > Shai, > > > > > > > > This is the code snippet I use inside my class... > > > > > > > > public class MySorter extends Sorter { > > > > > > > > @Override > > > > > > > > public DocMap sort(AtomicReader reader) throws IOException { > > > > > > > > final Map docVsId = loadSortTerm(reader); > > > > > > > > final Sorter.DocComparator comparator = new Sorter.DocComparator() > { > > > > > > > > @Override > > > > > > > > public int compare(int docID1, int docID2) { > > > > > > > > BytesRef v1 = docVsId.get(docID1); > > > > > > > > BytesRef v2 = docVsId.get(docID2); > > > > > > > > return v1.compareTo(v2); > > > > > > > > } > > > > > > > > }; > > > > > > > > return sort(reader.maxDoc(), comparator); > > > > > > > > } > > > > } > > > > > > > > My Problem is, the "AtomicReader" passed to Sorter.sort method is > > > actually > > > > a SlowCompositeReader, composed of a list of AtomicReaders each of > > which > > > is > > > > already sorted. > > > > > > > > I find this "loadSortTerm(compositeReader)" to be a bit heavy where > it > > > > tries to all load the doc-to-term mappings eagerly... > > > > > > > > Are there some alternatives for this? > > > > > > > > -- > > > > Ravi > > > > > > > > > > > > On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera > wrote: > > > > > > > > > I'm not sure that I follow ... where do you see DocMap being loaded > > up > > > > > front? Specifically, Sorter.sort may return null of the readers are > > > > already > > > > > sorted ... I think we already optimized for the case where the > > readers > > > > are > > > > > sorted. > > > > > > > > > > Shai > > > > > > > > > > > > > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan < > > > > > ravikumar.govindarajan@gmail.com> wrote: > > > > > > > > > > > I am planning to use SortingMergePolicy where all the > > > > merge-participating > > > > > > segments are already sorted... I understand that I need to > define a > > > > > DocMap > > > > > > with old-new doc-id mappings. > > > > > > > > > > > > Is it possible to optimize the eager loading of DocMap and make > it > > > kind > > > > > of > > > > > > lazy load on-demand? > > > > > > > > > > > > Ex: Pass List to the caller and ask for next > new-old > > > doc > > > > > > mapping.. > > > > > > > > > > > > Since my segments are already sorted, I could save on memory a > > > > little-bit > > > > > > this way, instead of loading the full DocMap upfront > > > > > > > > > > > > -- > > > > > > Ravi > > > > > > > > > > > > > > > > > > > > > --089e013c62d4f4ae8704fc07755d--