Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 46639 invoked from network); 5 Oct 2010 10:21:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Oct 2010 10:21:04 -0000 Received: (qmail 74955 invoked by uid 500); 5 Oct 2010 10:21:03 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 74553 invoked by uid 500); 5 Oct 2010 10:21:00 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 74546 invoked by uid 99); 5 Oct 2010 10:21:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Oct 2010 10:21:00 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Oct 2010 10:20:54 +0000 Received: by wye20 with SMTP id 20so3748903wye.35 for ; Tue, 05 Oct 2010 03:20:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.17.194 with SMTP id j44mr8925913wej.68.1286274033065; Tue, 05 Oct 2010 03:20:33 -0700 (PDT) Received: by 10.216.70.135 with HTTP; Tue, 5 Oct 2010 03:20:33 -0700 (PDT) In-Reply-To: References: Date: Tue, 5 Oct 2010 06:20:33 -0400 Message-ID: Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing From: Michael McCandless To: dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Nice paper! It's a neat trick to index the large postings as separate files, ie let the fileystem handle the growth as new postings are appended over time. But, unfortunately, we can't easily do this in Lucene, since Lucene assumes index files are write once, and derives its transactional semantics from this approach. Ie, this would require sizable changes, beyond just swapping in a different Codec. Still, the idea that small/big postings lists should be handled differently is something we can take advantage of in a Codec, and I think we should. I think likely we will switch to a default codec that uses pulsing (storing term's postiugs directly in terms dict) for very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR for high freq terms. Mike On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom wrote= : > Hi all, > > Would it be possible to implement something like this in Flex? > > > B=FCttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for c= ontiguous inverted lists. Information Retrieval, 11(3), 175-207. doi:10.100= 7/s10791-007-9042-8 > > The approach takes advantage of having a different policy for large posti= ngs lists (ie frequent terms) =A0versus small postings lists for flushing t= he buffer and writing to disk. > > > Tom Burton-West > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org