lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Actual min and max-value of NumericField during codec flush
Date Fri, 07 Feb 2014 10:47:32 GMT
You want to focus merging on the segments containing newer documents?
Why?  This seems somewhat dangerous...

Not taking into account the "true" segment size can lead to very very
poor merge decisions ... you should turn on IndexWriter's infoStream
and do a long running test to convince yourself the merging is being
sane.

Mike

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
<ravikumar.govindarajan@gmail.com> wrote:
> Thanks Mike,
>
> Will try your suggestion. I will try to describe the actual use-case itself
>
> There is a requirement for merging time-adjacent segments [append-only,
> rolling time-series data]
>
> All Documents have a timestamp affixed and during flush I need to note down
> the least timestamp for all documents, through Codec.
>
> Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>
> LogMergePolicy will auto-arrange levels of segments according time and
> proceed with merges. Latest segments will be lesser in size and preferred
> during merges than older and bigger segments
>
> Do you think such an approach will be fine or there are better ways to
> solve this?
>
> --
> Ravi
>
>
> On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Somewhere in those numeric trie terms are the exact integers from your
>> documents, encoded.
>>
>> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> value back from the BytesRef term.
>>
>> But you need to filter out the "higher level" terms, e.g. using
>> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
>> all the terms you want come first, so once you hit a term where
>> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
>> checking.
>>
>> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
>> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> <ravikumar.govindarajan@gmail.com> wrote:
>> > I use a Codec to flush data. All methods delegate to actual
>> Lucene42Codec,
>> > except for intercepting one single-field. This field is indexed as an
>> > IntField [Numeric-Trie...], with precisionStep=4.
>> >
>> > The purpose of the Codec is as follows
>> >
>> > 1. Note the first BytesRef for this field
>> > 2. During finish() call [TermsConsumer.java], note the last BytesRef for
>> > this field
>> > 3. Converts both the first/last BytesRef to respective integers
>> > 4. Store these 2 ints in segment-info diagnostics
>> >
>> > The problem with this approach is that, first/last BytesRef is totally
>> > different from the actual "int" values I try to index. I guess, this is
>> > because Numeric-Trie explodes all the integers into it's own format of
>> > BytesRefs. Hence my Codec stores the wrong values in segment-diagnostics
>> >
>> > Is there a way I can record actual min/max int-values correctly in my
>> codec
>> > and still support NumericRange search?
>> >
>> > --
>> > Ravi
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message