lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Actual min and max-value of NumericField during codec flush
Date Wed, 12 Feb 2014 12:51:52 GMT
OK, I see (early termination).

That's a challenge, because you really want the docs sorted backwards
from how they were added right?  And, e.g., merged and then searched
in "reverse segment order"?

I think you should be able to do this w/ SortingMergePolicy?  And then
use a custom collector that stops after you've gone back enough in
time for a given search.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
<ravikumar.govindarajan@gmail.com> wrote:
> Mike,
>
> All our queries need to be sorted by timestamp field, in descending order
> of time. [latest-first]
>
> Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
> segments and merges them [even with SortingMergePolicy etc...]. I am trying
> to avoid this and see if an approximate global ordering of segments [by
> time-stamp field] can be maintained via merge.
>
> Ex: TopN results will only examine recent 2-3 smaller segments [best-case]
> and return, without examining older and bigger segments.
>
> I do not know the terminology, may be "Early Query Termination Across
> Segments" etc...?
>
> --
> Ravi
>
>
> On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
>> order.
>>
>> Only TieredMergePolicy merges out-of-order segments.
>>
>> I don't understand why you need to encouraging merging of the more
>> recent (by your "time" field) segments...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
>> <ravikumar.govindarajan@gmail.com> wrote:
>> > Mike,
>> >
>> > Each of my flushed segment is fully ordered by time. But
>> TieredMergePolicy
>> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
>> > disturb this arrangement and I wanted some kind of control on this.
>> >
>> > But like you pointed-out, going by only be time-adjacent merges can be
>> > disastrous.
>> >
>> > Is there a way to mix both time and size to arrive at a somewhat
>> > [less-than-accurate] global order of segment merges.
>> >
>> > Like attempt a time-adjacent merge, provided size of segments is not
>> > extremely skewed etc...
>> >
>> > --
>> > Ravi
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> You want to focus merging on the segments containing newer documents?
>> >> Why?  This seems somewhat dangerous...
>> >>
>> >> Not taking into account the "true" segment size can lead to very very
>> >> poor merge decisions ... you should turn on IndexWriter's infoStream
>> >> and do a long running test to convince yourself the merging is being
>> >> sane.
>> >>
>> >> Mike
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
>> >> <ravikumar.govindarajan@gmail.com> wrote:
>> >> > Thanks Mike,
>> >> >
>> >> > Will try your suggestion. I will try to describe the actual use-case
>> >> itself
>> >> >
>> >> > There is a requirement for merging time-adjacent segments
>> [append-only,
>> >> > rolling time-series data]
>> >> >
>> >> > All Documents have a timestamp affixed and during flush I need to note
>> >> down
>> >> > the least timestamp for all documents, through Codec.
>> >> >
>> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define
the
>> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>> >> >
>> >> > LogMergePolicy will auto-arrange levels of segments according time
and
>> >> > proceed with merges. Latest segments will be lesser in size and
>> preferred
>> >> > during merges than older and bigger segments
>> >> >
>> >> > Do you think such an approach will be fine or there are better ways
to
>> >> > solve this?
>> >> >
>> >> > --
>> >> > Ravi
>> >> >
>> >> >
>> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
>> >> > lucene@mikemccandless.com> wrote:
>> >> >
>> >> >> Somewhere in those numeric trie terms are the exact integers from
>> your
>> >> >> documents, encoded.
>> >> >>
>> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> >> >> value back from the BytesRef term.
>> >> >>
>> >> >> But you need to filter out the "higher level" terms, e.g. using
>> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
>> >> >> all the terms you want come first, so once you hit a term where
>> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you
can
>> stop
>> >> >> checking.
>> >> >>
>> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
that
>> >> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>> >> >>
>> >> >> Mike McCandless
>> >> >>
>> >> >> http://blog.mikemccandless.com
>> >> >>
>> >> >>
>> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> >> >> <ravikumar.govindarajan@gmail.com> wrote:
>> >> >> > I use a Codec to flush data. All methods delegate to actual
>> >> >> Lucene42Codec,
>> >> >> > except for intercepting one single-field. This field is indexed
as
>> an
>> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
>> >> >> >
>> >> >> > The purpose of the Codec is as follows
>> >> >> >
>> >> >> > 1. Note the first BytesRef for this field
>> >> >> > 2. During finish() call [TermsConsumer.java], note the last
>> BytesRef
>> >> for
>> >> >> > this field
>> >> >> > 3. Converts both the first/last BytesRef to respective integers
>> >> >> > 4. Store these 2 ints in segment-info diagnostics
>> >> >> >
>> >> >> > The problem with this approach is that, first/last BytesRef
is
>> totally
>> >> >> > different from the actual "int" values I try to index. I guess,
>> this
>> >> is
>> >> >> > because Numeric-Trie explodes all the integers into it's own
>> format of
>> >> >> > BytesRefs. Hence my Codec stores the wrong values in
>> >> segment-diagnostics
>> >> >> >
>> >> >> > Is there a way I can record actual min/max int-values correctly
in
>> my
>> >> >> codec
>> >> >> > and still support NumericRange search?
>> >> >> >
>> >> >> > --
>> >> >> > Ravi
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message