lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Actual min and max-value of NumericField during codec flush
Date Fri, 07 Feb 2014 04:24:20 GMT
Thanks Mike,

Will try your suggestion. I will try to describe the actual use-case itself

There is a requirement for merging time-adjacent segments [append-only,
rolling time-series data]

All Documents have a timestamp affixed and during flush I need to note down
the least timestamp for all documents, through Codec.

Then, I define a TimeMergePolicy extends LogMergePolicy and define the
segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].

LogMergePolicy will auto-arrange levels of segments according time and
proceed with merges. Latest segments will be lesser in size and preferred
during merges than older and bigger segments

Do you think such an approach will be fine or there are better ways to
solve this?

--
Ravi


On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Somewhere in those numeric trie terms are the exact integers from your
> documents, encoded.
>
> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> value back from the BytesRef term.
>
> But you need to filter out the "higher level" terms, e.g. using
> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
> all the terms you want come first, so once you hit a term where
> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
> checking.
>
> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> <ravikumar.govindarajan@gmail.com> wrote:
> > I use a Codec to flush data. All methods delegate to actual
> Lucene42Codec,
> > except for intercepting one single-field. This field is indexed as an
> > IntField [Numeric-Trie...], with precisionStep=4.
> >
> > The purpose of the Codec is as follows
> >
> > 1. Note the first BytesRef for this field
> > 2. During finish() call [TermsConsumer.java], note the last BytesRef for
> > this field
> > 3. Converts both the first/last BytesRef to respective integers
> > 4. Store these 2 ints in segment-info diagnostics
> >
> > The problem with this approach is that, first/last BytesRef is totally
> > different from the actual "int" values I try to index. I guess, this is
> > because Numeric-Trie explodes all the integers into it's own format of
> > BytesRefs. Hence my Codec stores the wrong values in segment-diagnostics
> >
> > Is there a way I can record actual min/max int-values correctly in my
> codec
> > and still support NumericRange search?
> >
> > --
> > Ravi
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message