Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of
 ravikumar.govindarajan@gmail.com designates 74.125.82.51 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALfq-2SK_2ejk8CCOCrm-uU6N_qfGgX0P0DTJNYLOF5jGMftxA@mail.gmail.com>
References: 
 <CAGW2whQC5OvMHyCq8sPSRf144XQATJYLsHvzQY4w056fJXZt1w@mail.gmail.com>
	<CALfq-2T0QkZKx+U2pcQcgmXActfwzqVZaWgAVkBz-yKzM_8bkQ@mail.gmail.com>
	<CAGW2whQLf_2PuGu502DfLKapxN7GbQ=gm5YSV-pkDYrL7QiyUg@mail.gmail.com>
	<CALfq-2TiP2X+QofoBpAv+x=xxb09MDW+zx5kz6_=sOQ2d_-YEQ@mail.gmail.com>
	<CAGW2whRpvO=Lt2P4VDd4A6UoOX_EQH1bs51Zge11Tugmb9ELjw@mail.gmail.com>
	<CALfq-2SK_2ejk8CCOCrm-uU6N_qfGgX0P0DTJNYLOF5jGMftxA@mail.gmail.com>
Date: Tue, 17 Jun 2014 18:11:20 +0530
Message-ID: 
 <CAGW2whTt+UVU-TT3Cd=LPHUPJsOix+GKj0zZUugbYaTMvApYKA@mail.gmail.com>
Subject: Re: SortingMergePolicy for already sorted segments
From: Ravikumar Govindarajan <ravikumar.govindarajan@gmail.com>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Content-Type: multipart/alternative; boundary=089e013c62d4f4ae8704fc07755d

--089e013c62d4f4ae8704fc07755d
Content-Type: text/plain; charset=UTF-8

>
> Therefore the DocMap is initialized only when the
> merge actually executes ... what is there more to postpone?


Agreed. However, what I am asking is, if there is an alternative to DocMap,
will that be better? Plz read-on

 And besides, if the segments are already sorted, you should return a
null DocMap,
> like Lucene code does ...


What I am trying to say is, my individual segments are sorted. However,
when a merge combines "N" individual sorted-segments, there needs to be a
global sort-order for writing the new segment. Passing null DocMap won't
work here, no?

DocMap is one-way of bringing the global order during a merge. Another way
is to use something like a MergedIterator<SegmentReader> instead of DocMap,
which doesn't need any memory

I was trying to get a heads-up on these 2 approaches. Please do let me know
if I have understood correctly

--
Ravi


On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera <serera@gmail.com> wrote:

> >
> > I am afraid the DocMap still maintains doc-id mappings till merge and I
> am
> > trying to avoid it...
> >
>
> What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
> called only when the merge is executed, not when the MergePolicy decided to
> merge those segments. Therefore the DocMap is initialized only when the
> merge actually executes ... what is there more to postpone?
>
> And besides, if the segments are already sorted, you should return a null
> DocMap, like Lucene code does ...
>
> If I miss your point, I'd appreciate if you can point me to a code example,
> preferably in Lucene source, which demonstrates the problem.
>
> Shai
>
>
> On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
> > I am afraid the DocMap still maintains doc-id mappings till merge and I
> am
> > trying to avoid it...
> >
> > I think lucene itself has a MergeIterator in o.a.l.util package.
> >
> > A MergePolicy can wrap a simple MergeIterator for iterating docs across
> > different AtomicReaders in correct sort-order for a given field/term
> >
> > That should be fine right?
> >
> > --
> > Ravi
> >
> > --
> > Ravi
> >
> >
> > On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera <serera@gmail.com> wrote:
> >
> > > loadSortTerm is your method right? In the current Sorter.sort
> > > implementation, I see this code:
> > >
> > >     boolean sorted = true;
> > >     for (int i = 1; i < maxDoc; ++i) {
> > >       if (comparator.compare(i-1, i) > 0) {
> > >         sorted = false;
> > >         break;
> > >       }
> > >     }
> > >     if (sorted) {
> > >       return null;
> > >     }
> > >
> > > Perhaps you can write similar code?
> > >
> > > Also note that the sorting interface has changed, I think in 4.8, and
> now
> > > you don't really need to implement a Sorter, but rather pass a
> SortField,
> > > if that works for you.
> > >
> > > Shai
> > >
> > >
> > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan <
> > > ravikumar.govindarajan@gmail.com> wrote:
> > >
> > > > Shai,
> > > >
> > > > This is the code snippet I use inside my class...
> > > >
> > > > public class MySorter extends Sorter {
> > > >
> > > > @Override
> > > >
> > > > public DocMap sort(AtomicReader reader) throws IOException {
> > > >
> > > >   final Map<Integer, BytesRef> docVsId = loadSortTerm(reader);
> > > >
> > > >   final Sorter.DocComparator comparator = new Sorter.DocComparator()
> {
> > > >
> > > >   @Override
> > > >
> > > >    public int compare(int docID1, int docID2) {
> > > >
> > > >       BytesRef v1 = docVsId.get(docID1);
> > > >
> > > >       BytesRef v2 = docVsId.get(docID2);
> > > >
> > > >        return v1.compareTo(v2);
> > > >
> > > >    }
> > > >
> > > >  };
> > > >
> > > >  return sort(reader.maxDoc(), comparator);
> > > >
> > > > }
> > > > }
> > > >
> > > > My Problem is, the "AtomicReader" passed to Sorter.sort method is
> > > actually
> > > > a SlowCompositeReader, composed of a list of AtomicReaders each of
> > which
> > > is
> > > > already sorted.
> > > >
> > > > I find this "loadSortTerm(compositeReader)" to be a bit heavy where
> it
> > > > tries to all load the doc-to-term mappings eagerly...
> > > >
> > > > Are there some alternatives for this?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > > On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera <serera@gmail.com>
> wrote:
> > > >
> > > > > I'm not sure that I follow ... where do you see DocMap being loaded
> > up
> > > > > front? Specifically, Sorter.sort may return null of the readers are
> > > > already
> > > > > sorted ... I think we already optimized for the case where the
> > readers
> > > > are
> > > > > sorted.
> > > > >
> > > > > Shai
> > > > >
> > > > >
> > > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan <
> > > > > ravikumar.govindarajan@gmail.com> wrote:
> > > > >
> > > > > > I am planning to use SortingMergePolicy where all the
> > > > merge-participating
> > > > > > segments are already sorted... I understand that I need to
> define a
> > > > > DocMap
> > > > > > with old-new doc-id mappings.
> > > > > >
> > > > > > Is it possible to optimize the eager loading of DocMap and make
> it
> > > kind
> > > > > of
> > > > > > lazy load on-demand?
> > > > > >
> > > > > > Ex: Pass List<AtomicReader> to the caller and ask for next
> new-old
> > > doc
> > > > > > mapping..
> > > > > >
> > > > > > Since my segments are already sorted, I could save on memory a
> > > > little-bit
> > > > > > this way, instead of loading the full DocMap upfront
> > > > > >
> > > > > > --
> > > > > > Ravi
> > > > > >
> > > > >
> > > >
> > >
> >
>

--089e013c62d4f4ae8704fc07755d--