Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 83428 invoked from network); 29 Apr 2008 13:24:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Apr 2008 13:24:02 -0000 Received: (qmail 88007 invoked by uid 500); 29 Apr 2008 13:24:03 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 87714 invoked by uid 500); 29 Apr 2008 13:24:01 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 87703 invoked by uid 99); 29 Apr 2008 13:24:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Apr 2008 06:24:01 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.132.247] (HELO an-out-0708.google.com) (209.85.132.247) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Apr 2008 13:23:17 +0000 Received: by an-out-0708.google.com with SMTP id c37so49900anc.49 for ; Tue, 29 Apr 2008 06:23:28 -0700 (PDT) Received: by 10.100.91.17 with SMTP id o17mr1495821anb.145.1209475408702; Tue, 29 Apr 2008 06:23:28 -0700 (PDT) Received: by 10.100.154.5 with HTTP; Tue, 29 Apr 2008 06:23:28 -0700 (PDT) Message-ID: <9ac0c6aa0804290623o20bcc9d0r898dc397920ae1d5@mail.gmail.com> Date: Tue, 29 Apr 2008 09:23:28 -0400 From: "Michael McCandless" To: java-dev@lucene.apache.org Subject: Re: Flexible indexing design In-Reply-To: <8E7F245F-7688-4236-8B76-5933F9C76DF6@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <47FB88D2.2050505@gmail.com> <7790C485-ED68-4388-AEEF-F3246413F057@rectangular.com> <9ac0c6aa0804130235w7d96def6o38c2bd2a95037daa@mail.gmail.com> <9ac0c6aa0804171157j745fcc12w66d89d87e4fb50aa@mail.gmail.com> <957D52C2-7785-4BC6-A671-A47B1F82FC1C@rectangular.com> <9ac0c6aa0804240447i75632056g40d21d2e5ba641f8@mail.gmail.com> <03F28A65-AF05-4211-BF2D-D6C3399A6531@rectangular.com> <9ac0c6aa0804270328q17a49c10ya0ef064958b5587b@mail.gmail.com> <8E7F245F-7688-4236-8B76-5933F9C76DF6@rectangular.com> X-Virus-Checked: Checked by ClamAV on apache.org Marvin Humphrey wrote: > > Container is only aware of the single inStream, while codec can still > > think its operating on 3 even if it's really 1 or 2. > > > > I don't understand. If you have three streams, all of them are going to > have to get skipped, right? For the "all data in one stream" (KS's dev branch) approach, I'm picturing (and I'm not really "sure" on any of this until we get our feet real wet here) that the container is told "you really only have 1 stream" even though the codec thinks it's got 3 streams. And the container holds onto that stream. I agree container alone interacts with skip data. But I also thought container alone would do "big skips", and then use codec only for doc-by-doc skipping only once it's close to the target. If instead you do the "one file per type" approach (current released KS & Lucene), then the container would know it has 3 real streams and would seek all 3 on a big skip. > > And even so one could plug in their own (single stream) codec if need be? > > > > Sure. > > The question is how we set up PostingList's iterator. In KS right now, > SegPList_Next() calls Posting's Read_Record() method, which takes an > InStream as an argument -- Posting doesn't maintain any streams internally. > > As soon as the codec starts reading from an indeterminate number of > streams, though, having the container pass them in for each read op won't > work any more. The codec will have to be responsible for all data streams. Agreed. > > > The only penalty for having a TermPositions object read > > > positions in bulk like that is memory footprint (since each TermPositions > > > object needs a growable array of 32-bit integers to store the bulk > > > positions). > > > > > > > This is a tiny memory cost right? A TermPositions instance gets > > re-used each time you next() to the next term. > > > > With large documents, the buffer size requirements can get pretty big -- > consider how often the term "the" might appear in a 1000-page novel. Lucene > and its relatives don't work very well with novel-sized documents anyway, > though, so for better or worse, I blew it off. OK good point. There are users now and then, for better or worse, who do seem to index massive documents. > > I think you also pay an added up-front (small) decoding cost in cases > > where the consumer will only look at a subset of the positions before > > next()'ing. Eg a phrase search involving a rare term and a frequent > > term. > > > > Yes, you're right about that. The KS phrase scorer has less function call > overhead, though -- it's a more limited design, with no 'slop' support, and > it operates using pointer math on the arrays of positions directly rather > than having to make a lot of accessor calls. > > My guess is that it's a wash, algorithm-wise. It seems likely that file > format would have a significant effect, but I would be surprised to see > phrase scorer performance in KS and Lucene diverge all that much as a result > of the algorithmic implementation. > > That's why I asserted that the main motivation for bulk-reading positions > in KS was simplicity, rather than performance or something else. OK that seems like a likely hypothesis, though somehow we should verify it in practice. Performance of phrase searching (and its relatives) is important. > > But the good news is if this framework is plugglable, one can insert > > their own codec to not do the up-front decoding of all positions per > > term X doc. > > > > Yes, that would work. You would need different Scorer implementations to > deal with codecs which read the same file format yet are implemented > differently... but that's a feature. :) Yeah. This is very much an way-expert use case. > > > We'd need both PostingBuffer and Posting subclasses. > > > > > > > Yes. > > > > OK, I think we're almost at the point where we can hack up prototypes for > Lucene implementations of PostingBuffer and PostingList that read the > current file format. I think seeing those would clarify things. > > Ironically, I'm not sure exactly how Posting fits into the equation at > read-time anymore, but I think that will work itself out as we get back to > indexing. Agreed, it's time to get our feet weet. Though, I'm working "top down", so my first focus is on modularizing DocumentsWriter such that new features like LUCENE-1231 (column-stride fields) are more or less "just" a plugin into indexing. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org