Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 496E4102C6 for ; Wed, 12 Feb 2014 15:00:03 +0000 (UTC) Received: (qmail 33598 invoked by uid 500); 12 Feb 2014 15:00:00 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 33185 invoked by uid 500); 12 Feb 2014 14:59:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 33174 invoked by uid 99); 12 Feb 2014 14:59:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 14:59:51 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.220.169] (HELO mail-vc0-f169.google.com) (209.85.220.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 14:59:47 +0000 Received: by mail-vc0-f169.google.com with SMTP id hq11so7278307vcb.0 for ; Wed, 12 Feb 2014 06:59:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=+Q0nkyTlhLly9sZX3XLRiHrRsVa6j1oE8FtxkVwYkPs=; b=LbovaSIiugT9Oj4Jx+lUljlv715WnO+rsX0/3ffsfbJWJIz2Viv/Vc5LgOoN6sbGDT 0wVWIzyBRbGn8abEyTOfbyAiYdo5jkPzc5ErSDY7vkLcR5C74w/yABwbUptqb9E1Mz0t +RblH7dv6w0oxufFYHVwRkZPa+wou2PvTAVZhlMx/e+UNoAYksSIL8dyPYeQyhWW5e1K WyrnCC9R+j89dVphTVDRhc5W+KbXwBfOwP4ez3/bqE+xqH8gnoy2rfrVATb35k0OYge6 S5TFgZAuYdEjss4ituNp1xUclfuEzp1FEaAANCxe2euWdVnn6ZxznGc7NLFtv8NawMKt XJxA== X-Gm-Message-State: ALoCoQlEa/NMXznZK9E7Yoc0a2VBD24VfAyijdTPn20sIzhfcKdLN5t4lc4QIrCB+XpgqTvuU2KY X-Received: by 10.58.161.227 with SMTP id xv3mr473609veb.31.1392217165792; Wed, 12 Feb 2014 06:59:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.221.5.3 with HTTP; Wed, 12 Feb 2014 06:59:05 -0800 (PST) In-Reply-To: References: From: Michael McCandless Date: Wed, 12 Feb 2014 09:59:05 -0500 Message-ID: Subject: Re: Actual min and max-value of NumericField during codec flush To: Lucene Users Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Right, I think you'll need to use either of the LogXMergePolicy (or subclass LogMergePolicy and make your own): they always pick adjacent segments to merge. SortingMP let's you pass in the MP to wrap, so just pass in a LogXMP, and then sort by timestamp? Mike McCandless http://blog.mikemccandless.com On Wed, Feb 12, 2014 at 8:16 AM, Ravikumar Govindarajan wrote: > Yes exactly as you have described. > > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and > goes for a merge > > While SortingMergePolicy will correctly solve the merge-part, it does not > however play any role in picking segments to merge right? > > SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to > merge disturbing the global-order. Ideally only "adjacent" segments should > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc... > > Can there be a better selection of segments to merge in this case, so as to > maintain a semblance of global-ordering? > > -- > Ravi > > > > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless < > lucene@mikemccandless.com> wrote: > >> OK, I see (early termination). >> >> That's a challenge, because you really want the docs sorted backwards >> from how they were added right? And, e.g., merged and then searched >> in "reverse segment order"? >> >> I think you should be able to do this w/ SortingMergePolicy? And then >> use a custom collector that stops after you've gone back enough in >> time for a given search. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan >> wrote: >> > Mike, >> > >> > All our queries need to be sorted by timestamp field, in descending order >> > of time. [latest-first] >> > >> > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary >> > segments and merges them [even with SortingMergePolicy etc...]. I am >> trying >> > to avoid this and see if an approximate global ordering of segments [by >> > time-stamp field] can be maintained via merge. >> > >> > Ex: TopN results will only examine recent 2-3 smaller segments >> [best-case] >> > and return, without examining older and bigger segments. >> > >> > I do not know the terminology, may be "Early Query Termination Across >> > Segments" etc...? >> > >> > -- >> > Ravi >> > >> > >> > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless < >> > lucene@mikemccandless.com> wrote: >> > >> >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total >> >> order. >> >> >> >> Only TieredMergePolicy merges out-of-order segments. >> >> >> >> I don't understand why you need to encouraging merging of the more >> >> recent (by your "time" field) segments... >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan >> >> wrote: >> >> > Mike, >> >> > >> >> > Each of my flushed segment is fully ordered by time. But >> >> TieredMergePolicy >> >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and >> >> > disturb this arrangement and I wanted some kind of control on this. >> >> > >> >> > But like you pointed-out, going by only be time-adjacent merges can be >> >> > disastrous. >> >> > >> >> > Is there a way to mix both time and size to arrive at a somewhat >> >> > [less-than-accurate] global order of segment merges. >> >> > >> >> > Like attempt a time-adjacent merge, provided size of segments is not >> >> > extremely skewed etc... >> >> > >> >> > -- >> >> > Ravi >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless < >> >> > lucene@mikemccandless.com> wrote: >> >> > >> >> >> You want to focus merging on the segments containing newer documents? >> >> >> Why? This seems somewhat dangerous... >> >> >> >> >> >> Not taking into account the "true" segment size can lead to very very >> >> >> poor merge decisions ... you should turn on IndexWriter's infoStream >> >> >> and do a long running test to convince yourself the merging is being >> >> >> sane. >> >> >> >> >> >> Mike >> >> >> >> >> >> Mike McCandless >> >> >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> >> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan >> >> >> wrote: >> >> >> > Thanks Mike, >> >> >> > >> >> >> > Will try your suggestion. I will try to describe the actual >> use-case >> >> >> itself >> >> >> > >> >> >> > There is a requirement for merging time-adjacent segments >> >> [append-only, >> >> >> > rolling time-series data] >> >> >> > >> >> >> > All Documents have a timestamp affixed and during flush I need to >> note >> >> >> down >> >> >> > the least timestamp for all documents, through Codec. >> >> >> > >> >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define >> the >> >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. >> >> >> > >> >> >> > LogMergePolicy will auto-arrange levels of segments according time >> and >> >> >> > proceed with merges. Latest segments will be lesser in size and >> >> preferred >> >> >> > during merges than older and bigger segments >> >> >> > >> >> >> > Do you think such an approach will be fine or there are better >> ways to >> >> >> > solve this? >> >> >> > >> >> >> > -- >> >> >> > Ravi >> >> >> > >> >> >> > >> >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless < >> >> >> > lucene@mikemccandless.com> wrote: >> >> >> > >> >> >> >> Somewhere in those numeric trie terms are the exact integers from >> >> your >> >> >> >> documents, encoded. >> >> >> >> >> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int >> >> >> >> value back from the BytesRef term. >> >> >> >> >> >> >> >> But you need to filter out the "higher level" terms, e.g. using >> >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0. Or use >> >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum. I >> believe >> >> >> >> all the terms you want come first, so once you hit a term where >> >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can >> >> stop >> >> >> >> checking. >> >> >> >> >> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so >> that >> >> >> >> you can e.g. pull your own TermsEnum and iterate the terms >> yourself. >> >> >> >> >> >> >> >> Mike McCandless >> >> >> >> >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan >> >> >> >> wrote: >> >> >> >> > I use a Codec to flush data. All methods delegate to actual >> >> >> >> Lucene42Codec, >> >> >> >> > except for intercepting one single-field. This field is indexed >> as >> >> an >> >> >> >> > IntField [Numeric-Trie...], with precisionStep=4. >> >> >> >> > >> >> >> >> > The purpose of the Codec is as follows >> >> >> >> > >> >> >> >> > 1. Note the first BytesRef for this field >> >> >> >> > 2. During finish() call [TermsConsumer.java], note the last >> >> BytesRef >> >> >> for >> >> >> >> > this field >> >> >> >> > 3. Converts both the first/last BytesRef to respective integers >> >> >> >> > 4. Store these 2 ints in segment-info diagnostics >> >> >> >> > >> >> >> >> > The problem with this approach is that, first/last BytesRef is >> >> totally >> >> >> >> > different from the actual "int" values I try to index. I guess, >> >> this >> >> >> is >> >> >> >> > because Numeric-Trie explodes all the integers into it's own >> >> format of >> >> >> >> > BytesRefs. Hence my Codec stores the wrong values in >> >> >> segment-diagnostics >> >> >> >> > >> >> >> >> > Is there a way I can record actual min/max int-values correctly >> in >> >> my >> >> >> >> codec >> >> >> >> > and still support NumericRange search? >> >> >> >> > >> >> >> >> > -- >> >> >> >> > Ravi >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> >> >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org