lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Actual min and max-value of NumericField during codec flush
Date Thu, 13 Feb 2014 05:25:00 GMT
@Mike,

I had suggested the same approach in one of my previous mails, where-by
each segment records min/max timestamps in seg-info diagnostics and use it
for merging adjacent segments.

"Then, I define a TimeMergePolicy extends LogMergePolicy and define the
segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "

But you have expressed reservations

"This seems somewhat dangerous...

Not taking into account the "true" segment size can lead to very very
poor merge decisions ... you should turn on IndexWriter's infoStream
and do a long running test to convince yourself the merging is being
sane."

Will merging be disastrous, if I choose a TimeMergePolicy? I will also test
and verify, but it's always great to hear finer points from experts.

@Shai,

LogByteSizeMP categorizes "adjacency" by "size", whereas it would be better
if "timestamp" is used in my case

Sure, I need to wrap this in an SMP to make sure that the newly-created
segment is also in sorted-order

--
Ravi



On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera <serera@gmail.com> wrote:

> Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent
> segments and SortingMP ensures the merged segment is also sorted.
>
> Shai
>
>
> On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
> > Yes exactly as you have described.
> >
> > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order
> and
> > goes for a merge
> >
> > While SortingMergePolicy will correctly solve the merge-part, it does not
> > however play any role in picking segments to merge right?
> >
> > SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
> > merge disturbing the global-order. Ideally only "adjacent" segments
> should
> > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> >
> > Can there be a better selection of segments to merge in this case, so as
> to
> > maintain a semblance of global-ordering?
> >
> > --
> > Ravi
> >
> >
> >
> > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> > > OK, I see (early termination).
> > >
> > > That's a challenge, because you really want the docs sorted backwards
> > > from how they were added right?  And, e.g., merged and then searched
> > > in "reverse segment order"?
> > >
> > > I think you should be able to do this w/ SortingMergePolicy?  And then
> > > use a custom collector that stops after you've gone back enough in
> > > time for a given search.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > > <ravikumar.govindarajan@gmail.com> wrote:
> > > > Mike,
> > > >
> > > > All our queries need to be sorted by timestamp field, in descending
> > order
> > > > of time. [latest-first]
> > > >
> > > > Each segment is sorted in itself. But TieredMergePolicy picks
> arbitrary
> > > > segments and merges them [even with SortingMergePolicy etc...]. I am
> > > trying
> > > > to avoid this and see if an approximate global ordering of segments
> [by
> > > > time-stamp field] can be maintained via merge.
> > > >
> > > > Ex: TopN results will only examine recent 2-3 smaller segments
> > > [best-case]
> > > > and return, without examining older and bigger segments.
> > > >
> > > > I do not know the terminology, may be "Early Query Termination Across
> > > > Segments" etc...?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > > lucene@mikemccandless.com> wrote:
> > > >
> > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
> total
> > > >> order.
> > > >>
> > > >> Only TieredMergePolicy merges out-of-order segments.
> > > >>
> > > >> I don't understand why you need to encouraging merging of the more
> > > >> recent (by your "time" field) segments...
> > > >>
> > > >> Mike McCandless
> > > >>
> > > >> http://blog.mikemccandless.com
> > > >>
> > > >>
> > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > > >> <ravikumar.govindarajan@gmail.com> wrote:
> > > >> > Mike,
> > > >> >
> > > >> > Each of my flushed segment is fully ordered by time. But
> > > >> TieredMergePolicy
> > > >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments
> > and
> > > >> > disturb this arrangement and I wanted some kind of control on
> this.
> > > >> >
> > > >> > But like you pointed-out, going by only be time-adjacent merges
> can
> > be
> > > >> > disastrous.
> > > >> >
> > > >> > Is there a way to mix both time and size to arrive at a somewhat
> > > >> > [less-than-accurate] global order of segment merges.
> > > >> >
> > > >> > Like attempt a time-adjacent merge, provided size of segments
is
> not
> > > >> > extremely skewed etc...
> > > >> >
> > > >> > --
> > > >> > Ravi
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > > >> > lucene@mikemccandless.com> wrote:
> > > >> >
> > > >> >> You want to focus merging on the segments containing newer
> > documents?
> > > >> >> Why?  This seems somewhat dangerous...
> > > >> >>
> > > >> >> Not taking into account the "true" segment size can lead
to very
> > very
> > > >> >> poor merge decisions ... you should turn on IndexWriter's
> > infoStream
> > > >> >> and do a long running test to convince yourself the merging
is
> > being
> > > >> >> sane.
> > > >> >>
> > > >> >> Mike
> > > >> >>
> > > >> >> Mike McCandless
> > > >> >>
> > > >> >> http://blog.mikemccandless.com
> > > >> >>
> > > >> >>
> > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > > >> >> <ravikumar.govindarajan@gmail.com> wrote:
> > > >> >> > Thanks Mike,
> > > >> >> >
> > > >> >> > Will try your suggestion. I will try to describe the
actual
> > > use-case
> > > >> >> itself
> > > >> >> >
> > > >> >> > There is a requirement for merging time-adjacent segments
> > > >> [append-only,
> > > >> >> > rolling time-series data]
> > > >> >> >
> > > >> >> > All Documents have a timestamp affixed and during flush
I need
> to
> > > note
> > > >> >> down
> > > >> >> > the least timestamp for all documents, through Codec.
> > > >> >> >
> > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy
and
> > define
> > > the
> > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> > > >> >> >
> > > >> >> > LogMergePolicy will auto-arrange levels of segments
according
> > time
> > > and
> > > >> >> > proceed with merges. Latest segments will be lesser
in size and
> > > >> preferred
> > > >> >> > during merges than older and bigger segments
> > > >> >> >
> > > >> >> > Do you think such an approach will be fine or there
are better
> > > ways to
> > > >> >> > solve this?
> > > >> >> >
> > > >> >> > --
> > > >> >> > Ravi
> > > >> >> >
> > > >> >> >
> > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > > >> >> > lucene@mikemccandless.com> wrote:
> > > >> >> >
> > > >> >> >> Somewhere in those numeric trie terms are the exact
integers
> > from
> > > >> your
> > > >> >> >> documents, encoded.
> > > >> >> >>
> > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt
to get the
> > int
> > > >> >> >> value back from the BytesRef term.
> > > >> >> >>
> > > >> >> >> But you need to filter out the "higher level" terms,
e.g.
> using
> > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.
 Or use
> > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.
 I
> > > believe
> > > >> >> >> all the terms you want come first, so once you hit
a term
> where
> > > >> >> >> .getPrefixCodedLongShift is > 0, that's your
max term and you
> > can
> > > >> stop
> > > >> >> >> checking.
> > > >> >> >>
> > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has
improved, so
> > > that
> > > >> >> >> you can e.g. pull your own TermsEnum and iterate
the terms
> > > yourself.
> > > >> >> >>
> > > >> >> >> Mike McCandless
> > > >> >> >>
> > > >> >> >> http://blog.mikemccandless.com
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > > >> >> >> <ravikumar.govindarajan@gmail.com> wrote:
> > > >> >> >> > I use a Codec to flush data. All methods delegate
to actual
> > > >> >> >> Lucene42Codec,
> > > >> >> >> > except for intercepting one single-field. This
field is
> > indexed
> > > as
> > > >> an
> > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > > >> >> >> >
> > > >> >> >> > The purpose of the Codec is as follows
> > > >> >> >> >
> > > >> >> >> > 1. Note the first BytesRef for this field
> > > >> >> >> > 2. During finish() call [TermsConsumer.java],
note the last
> > > >> BytesRef
> > > >> >> for
> > > >> >> >> > this field
> > > >> >> >> > 3. Converts both the first/last BytesRef to
respective
> > integers
> > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > > >> >> >> >
> > > >> >> >> > The problem with this approach is that, first/last
BytesRef
> is
> > > >> totally
> > > >> >> >> > different from the actual "int" values I try
to index. I
> > guess,
> > > >> this
> > > >> >> is
> > > >> >> >> > because Numeric-Trie explodes all the integers
into it's own
> > > >> format of
> > > >> >> >> > BytesRefs. Hence my Codec stores the wrong
values in
> > > >> >> segment-diagnostics
> > > >> >> >> >
> > > >> >> >> > Is there a way I can record actual min/max
int-values
> > correctly
> > > in
> > > >> my
> > > >> >> >> codec
> > > >> >> >> > and still support NumericRange search?
> > > >> >> >> >
> > > >> >> >> > --
> > > >> >> >> > Ravi
> > > >> >> >>
> > > >> >> >>
> > > ---------------------------------------------------------------------
> > > >> >> >> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > >> >> >> For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > >> >> >>
> > > >> >> >>
> > > >> >>
> > > >> >>
> > ---------------------------------------------------------------------
> > > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message