Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E0FAF10D6B for ; Wed, 12 Feb 2014 12:52:47 +0000 (UTC) Received: (qmail 9492 invoked by uid 500); 12 Feb 2014 12:52:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 9280 invoked by uid 500); 12 Feb 2014 12:52:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 8388 invoked by uid 99); 12 Feb 2014 12:52:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 12:52:39 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.220.174] (HELO mail-vc0-f174.google.com) (209.85.220.174) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 12:52:33 +0000 Received: by mail-vc0-f174.google.com with SMTP id im17so6866026vcb.5 for ; Wed, 12 Feb 2014 04:52:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=ZPTOX+619iHrGLc4IOr0dxAmP6+4csDay1FbH0wF1P0=; b=KnMM0Y/upLsS4XPQZLPGZB1Gp+tKTJWYIEqeJwOG10eiD51Obp4wd7MPhanpSQTY/5 YUHZrguWT58po1tZFlL8qg/oCdEcgMkJxAeN41YOrjlm5tFQUyZdLMvqIpaNa5zMusoh I9Rjx1QVOryobbEfytM/pasG79TlzZt8Cccu2fwGHcbA0qQa+UpEmh9af7H5/ZuY/z/E iSxsEDfxmuk0pvWCIA+EbRdSzT+iT/vIejqELbv6pOID25/6arFc+HC2OEvId3Fso7L8 fD+jMA1/NpKU08230gjwf9Wexl34UMl97FdLhm5ZbnE0gbz3GR5xboH9nytDngg4shpW 7nbA== X-Gm-Message-State: ALoCoQkbXtcqiDfp6kc+0ycmwSlGC9pHKIQexWe19EfpNrvz6KvX9LbF4NkAYgQiJ7StTyYZr64U X-Received: by 10.220.193.132 with SMTP id du4mr11800vcb.39.1392209532384; Wed, 12 Feb 2014 04:52:12 -0800 (PST) MIME-Version: 1.0 Received: by 10.221.5.3 with HTTP; Wed, 12 Feb 2014 04:51:52 -0800 (PST) In-Reply-To: References: From: Michael McCandless Date: Wed, 12 Feb 2014 07:51:52 -0500 Message-ID: Subject: Re: Actual min and max-value of NumericField during codec flush To: Lucene Users Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org OK, I see (early termination). That's a challenge, because you really want the docs sorted backwards from how they were added right? And, e.g., merged and then searched in "reverse segment order"? I think you should be able to do this w/ SortingMergePolicy? And then use a custom collector that stops after you've gone back enough in time for a given search. Mike McCandless http://blog.mikemccandless.com On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan wrote: > Mike, > > All our queries need to be sorted by timestamp field, in descending order > of time. [latest-first] > > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary > segments and merges them [even with SortingMergePolicy etc...]. I am trying > to avoid this and see if an approximate global ordering of segments [by > time-stamp field] can be maintained via merge. > > Ex: TopN results will only examine recent 2-3 smaller segments [best-case] > and return, without examining older and bigger segments. > > I do not know the terminology, may be "Early Query Termination Across > Segments" etc...? > > -- > Ravi > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless < > lucene@mikemccandless.com> wrote: > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total >> order. >> >> Only TieredMergePolicy merges out-of-order segments. >> >> I don't understand why you need to encouraging merging of the more >> recent (by your "time" field) segments... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan >> wrote: >> > Mike, >> > >> > Each of my flushed segment is fully ordered by time. But >> TieredMergePolicy >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and >> > disturb this arrangement and I wanted some kind of control on this. >> > >> > But like you pointed-out, going by only be time-adjacent merges can be >> > disastrous. >> > >> > Is there a way to mix both time and size to arrive at a somewhat >> > [less-than-accurate] global order of segment merges. >> > >> > Like attempt a time-adjacent merge, provided size of segments is not >> > extremely skewed etc... >> > >> > -- >> > Ravi >> > >> > >> > >> > >> > >> > >> > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless < >> > lucene@mikemccandless.com> wrote: >> > >> >> You want to focus merging on the segments containing newer documents? >> >> Why? This seems somewhat dangerous... >> >> >> >> Not taking into account the "true" segment size can lead to very very >> >> poor merge decisions ... you should turn on IndexWriter's infoStream >> >> and do a long running test to convince yourself the merging is being >> >> sane. >> >> >> >> Mike >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan >> >> wrote: >> >> > Thanks Mike, >> >> > >> >> > Will try your suggestion. I will try to describe the actual use-case >> >> itself >> >> > >> >> > There is a requirement for merging time-adjacent segments >> [append-only, >> >> > rolling time-series data] >> >> > >> >> > All Documents have a timestamp affixed and during flush I need to note >> >> down >> >> > the least timestamp for all documents, through Codec. >> >> > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. >> >> > >> >> > LogMergePolicy will auto-arrange levels of segments according time and >> >> > proceed with merges. Latest segments will be lesser in size and >> preferred >> >> > during merges than older and bigger segments >> >> > >> >> > Do you think such an approach will be fine or there are better ways to >> >> > solve this? >> >> > >> >> > -- >> >> > Ravi >> >> > >> >> > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless < >> >> > lucene@mikemccandless.com> wrote: >> >> > >> >> >> Somewhere in those numeric trie terms are the exact integers from >> your >> >> >> documents, encoded. >> >> >> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int >> >> >> value back from the BytesRef term. >> >> >> >> >> >> But you need to filter out the "higher level" terms, e.g. using >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0. Or use >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum. I believe >> >> >> all the terms you want come first, so once you hit a term where >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can >> stop >> >> >> checking. >> >> >> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that >> >> >> you can e.g. pull your own TermsEnum and iterate the terms yourself. >> >> >> >> >> >> Mike McCandless >> >> >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan >> >> >> wrote: >> >> >> > I use a Codec to flush data. All methods delegate to actual >> >> >> Lucene42Codec, >> >> >> > except for intercepting one single-field. This field is indexed as >> an >> >> >> > IntField [Numeric-Trie...], with precisionStep=4. >> >> >> > >> >> >> > The purpose of the Codec is as follows >> >> >> > >> >> >> > 1. Note the first BytesRef for this field >> >> >> > 2. During finish() call [TermsConsumer.java], note the last >> BytesRef >> >> for >> >> >> > this field >> >> >> > 3. Converts both the first/last BytesRef to respective integers >> >> >> > 4. Store these 2 ints in segment-info diagnostics >> >> >> > >> >> >> > The problem with this approach is that, first/last BytesRef is >> totally >> >> >> > different from the actual "int" values I try to index. I guess, >> this >> >> is >> >> >> > because Numeric-Trie explodes all the integers into it's own >> format of >> >> >> > BytesRefs. Hence my Codec stores the wrong values in >> >> segment-diagnostics >> >> >> > >> >> >> > Is there a way I can record actual min/max int-values correctly in >> my >> >> >> codec >> >> >> > and still support NumericRange search? >> >> >> > >> >> >> > -- >> >> >> > Ravi >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org