Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
MIME-Version: 1.0
In-Reply-To: 
 <CAGW2whQPFztWK5UjN7MD+zuyryD1sOHrGG=Uy4Ci9FgMcT5f-g@mail.gmail.com>
References: 
 <CAGW2whTTnmyCcMm_J2QqRD5spOy2O9syWsZM8GijndAx00hzvA@mail.gmail.com>
 <CAL8PwkY8+vrDf5tMmpN3KzXKBGoyaNB6rWAPAe19gjgBSXo-JQ@mail.gmail.com>
 <CAGW2whTGWMnzUf_bdU5GMptuFEQfCbHpNxNXCaMGZSywAn=h7g@mail.gmail.com>
 <CAL8PwkbbVU4DhsrjxUnRc5yVePYs=oThCEPHNiKSs_uN94Fa+w@mail.gmail.com>
 <CAGW2whQe6wV3Um8nsMmOw9BG_cS8YgSknpwPEhrkZubHrmJ3WQ@mail.gmail.com>
 <CAL8PwkaEHEr8eb76urt=2zMoDh3MxH8zJC+d+Dn6ywxZ1iVdJw@mail.gmail.com>
 <CAGW2whTJDMg+WJvaa2ybHExZXJXxK34Wh+u=b8pa+LtCxbKNKQ@mail.gmail.com>
 <CAL8PwkagpJwubFeUCftMD0kr4=2KELhXqE8ex-OPX=CN0DHHCA@mail.gmail.com>
 <CAGW2whQPFztWK5UjN7MD+zuyryD1sOHrGG=Uy4Ci9FgMcT5f-g@mail.gmail.com>
From: Michael McCandless <lucene@mikemccandless.com>
Date: Wed, 12 Feb 2014 09:59:05 -0500
Message-ID: 
 <CAL8PwkaTvzjt8TpZ-ERwhPytj3gZh-hLRfxobxUT8ArHZ0oU_w@mail.gmail.com>
Subject: Re: Actual min and max-value of NumericField during codec flush
To: Lucene Users <java-user@lucene.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

Right, I think you'll need to use either of the LogXMergePolicy (or
subclass LogMergePolicy and make your own): they always pick adjacent
segments to merge.

SortingMP let's you pass in the MP to wrap, so just pass in a LogXMP,
and then sort by timestamp?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 12, 2014 at 8:16 AM, Ravikumar Govindarajan
<ravikumar.govindarajan@gmail.com> wrote:
> Yes exactly as you have described.
>
> Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and
> goes for a merge
>
> While SortingMergePolicy will correctly solve the merge-part, it does not
> however play any role in picking segments to merge right?
>
> SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
> merge disturbing the global-order. Ideally only "adjacent" segments should
> be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
>
> Can there be a better selection of segments to merge in this case, so as to
> maintain a semblance of global-ordering?
>
> --
> Ravi
>
>
>
> On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> OK, I see (early termination).
>>
>> That's a challenge, because you really want the docs sorted backwards
>> from how they were added right?  And, e.g., merged and then searched
>> in "reverse segment order"?
>>
>> I think you should be able to do this w/ SortingMergePolicy?  And then
>> use a custom collector that stops after you've gone back enough in
>> time for a given search.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
>> <ravikumar.govindarajan@gmail.com> wrote:
>> > Mike,
>> >
>> > All our queries need to be sorted by timestamp field, in descending order
>> > of time. [latest-first]
>> >
>> > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
>> > segments and merges them [even with SortingMergePolicy etc...]. I am
>> trying
>> > to avoid this and see if an approximate global ordering of segments [by
>> > time-stamp field] can be maintained via merge.
>> >
>> > Ex: TopN results will only examine recent 2-3 smaller segments
>> [best-case]
>> > and return, without examining older and bigger segments.
>> >
>> > I do not know the terminology, may be "Early Query Termination Across
>> > Segments" etc...?
>> >
>> > --
>> > Ravi
>> >
>> >
>> > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
>> >> order.
>> >>
>> >> Only TieredMergePolicy merges out-of-order segments.
>> >>
>> >> I don't understand why you need to encouraging merging of the more
>> >> recent (by your "time" field) segments...
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
>> >> <ravikumar.govindarajan@gmail.com> wrote:
>> >> > Mike,
>> >> >
>> >> > Each of my flushed segment is fully ordered by time. But
>> >> TieredMergePolicy
>> >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
>> >> > disturb this arrangement and I wanted some kind of control on this.
>> >> >
>> >> > But like you pointed-out, going by only be time-adjacent merges can be
>> >> > disastrous.
>> >> >
>> >> > Is there a way to mix both time and size to arrive at a somewhat
>> >> > [less-than-accurate] global order of segment merges.
>> >> >
>> >> > Like attempt a time-adjacent merge, provided size of segments is not
>> >> > extremely skewed etc...
>> >> >
>> >> > --
>> >> > Ravi
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
>> >> > lucene@mikemccandless.com> wrote:
>> >> >
>> >> >> You want to focus merging on the segments containing newer documents?
>> >> >> Why?  This seems somewhat dangerous...
>> >> >>
>> >> >> Not taking into account the "true" segment size can lead to very very
>> >> >> poor merge decisions ... you should turn on IndexWriter's infoStream
>> >> >> and do a long running test to convince yourself the merging is being
>> >> >> sane.
>> >> >>
>> >> >> Mike
>> >> >>
>> >> >> Mike McCandless
>> >> >>
>> >> >> http://blog.mikemccandless.com
>> >> >>
>> >> >>
>> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
>> >> >> <ravikumar.govindarajan@gmail.com> wrote:
>> >> >> > Thanks Mike,
>> >> >> >
>> >> >> > Will try your suggestion. I will try to describe the actual
>> use-case
>> >> >> itself
>> >> >> >
>> >> >> > There is a requirement for merging time-adjacent segments
>> >> [append-only,
>> >> >> > rolling time-series data]
>> >> >> >
>> >> >> > All Documents have a timestamp affixed and during flush I need to
>> note
>> >> >> down
>> >> >> > the least timestamp for all documents, through Codec.
>> >> >> >
>> >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define
>> the
>> >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>> >> >> >
>> >> >> > LogMergePolicy will auto-arrange levels of segments according time
>> and
>> >> >> > proceed with merges. Latest segments will be lesser in size and
>> >> preferred
>> >> >> > during merges than older and bigger segments
>> >> >> >
>> >> >> > Do you think such an approach will be fine or there are better
>> ways to
>> >> >> > solve this?
>> >> >> >
>> >> >> > --
>> >> >> > Ravi
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
>> >> >> > lucene@mikemccandless.com> wrote:
>> >> >> >
>> >> >> >> Somewhere in those numeric trie terms are the exact integers from
>> >> your
>> >> >> >> documents, encoded.
>> >> >> >>
>> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> >> >> >> value back from the BytesRef term.
>> >> >> >>
>> >> >> >> But you need to filter out the "higher level" terms, e.g. using
>> >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
>> believe
>> >> >> >> all the terms you want come first, so once you hit a term where
>> >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can
>> >> stop
>> >> >> >> checking.
>> >> >> >>
>> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
>> that
>> >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
>> yourself.
>> >> >> >>
>> >> >> >> Mike McCandless
>> >> >> >>
>> >> >> >> http://blog.mikemccandless.com
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> >> >> >> <ravikumar.govindarajan@gmail.com> wrote:
>> >> >> >> > I use a Codec to flush data. All methods delegate to actual
>> >> >> >> Lucene42Codec,
>> >> >> >> > except for intercepting one single-field. This field is indexed
>> as
>> >> an
>> >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
>> >> >> >> >
>> >> >> >> > The purpose of the Codec is as follows
>> >> >> >> >
>> >> >> >> > 1. Note the first BytesRef for this field
>> >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
>> >> BytesRef
>> >> >> for
>> >> >> >> > this field
>> >> >> >> > 3. Converts both the first/last BytesRef to respective integers
>> >> >> >> > 4. Store these 2 ints in segment-info diagnostics
>> >> >> >> >
>> >> >> >> > The problem with this approach is that, first/last BytesRef is
>> >> totally
>> >> >> >> > different from the actual "int" values I try to index. I guess,
>> >> this
>> >> >> is
>> >> >> >> > because Numeric-Trie explodes all the integers into it's own
>> >> format of
>> >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
>> >> >> segment-diagnostics
>> >> >> >> >
>> >> >> >> > Is there a way I can record actual min/max int-values correctly
>> in
>> >> my
>> >> >> >> codec
>> >> >> >> > and still support NumericRange search?
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Ravi
>> >> >> >>
>> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org