Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 36016 invoked from network); 10 Apr 2008 10:11:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Apr 2008 10:11:29 -0000 Received: (qmail 98379 invoked by uid 500); 10 Apr 2008 10:11:27 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 98322 invoked by uid 500); 10 Apr 2008 10:11:27 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 98311 invoked by uid 99); 10 Apr 2008 10:11:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Apr 2008 03:11:26 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.132.247] (HELO an-out-0708.google.com) (209.85.132.247) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Apr 2008 10:10:36 +0000 Received: by an-out-0708.google.com with SMTP id c5so796260anc.49 for ; Thu, 10 Apr 2008 03:10:55 -0700 (PDT) Received: by 10.100.214.19 with SMTP id m19mr2489766ang.1.1207822254924; Thu, 10 Apr 2008 03:10:54 -0700 (PDT) Received: by 10.100.154.5 with HTTP; Thu, 10 Apr 2008 03:10:54 -0700 (PDT) Message-ID: <9ac0c6aa0804100310m251f2b09u93fb54a64b98d437@mail.gmail.com> Date: Thu, 10 Apr 2008 06:10:54 -0400 From: "Michael McCandless" To: java-dev@lucene.apache.org Subject: Re: Flexible indexing design In-Reply-To: <48771BC9-CB7B-42CC-AF05-947FC90CF2D2@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <47FB88D2.2050505@gmail.com> <24883157-DFCF-4085-B239-4302360FA325@mikemccandless.com> <47FCC623.9030903@gmail.com> <48771BC9-CB7B-42CC-AF05-947FC90CF2D2@rectangular.com> X-Virus-Checked: Checked by ClamAV on apache.org Marvin Humphrey wrote: > On Apr 9, 2008, at 6:35 AM, Michael Busch wrote: > > > > We also need to come up with a good solution for the dictionary, because a > term with frq/prx postings needs to store two (or three for skiplist) file > pointers in the dictionary, whereas e. g. a "binary" posting list only needs > one pointer. > > > > This is something I'm working on as well, and I hope we can solve a couple > of design problems I've been turning over in my mind for some time. > > In KS, the information Lucene stores in the frq/prx files is carried in one > postings file per field, as discussed previously. However, I made the > additional change of breaking out skip data into a separate file (shared > across all fields). Isolating skip data sacrifices some locality of > reference, but buys substantial gains in simplicity and > compartmentalization. Individual Posting subclasses, each of which defines > a file format, don't have to know about skip algorithms at all. :) > Further, improvements in the skip algorithm only require changes to the > .skip file, and falling back to PostingList_Next still works if the .skip > file becomes corrupted since .skip carries only optimization info and no > real data. Can't you compartmentalize while still serializing skip data into the single frq/prx file? This as analagous to how videos are encoded. EG the AVI file format is a "container" format, and in contains packets of video and packets of audio, interleaved at the right rate so a player can play both in sync. The "container" has no idea how to decode the audio and video packets. Separate codecs do that. Taking this back to Lucene, there's a container format that, using TermInfo, knows where the frq/prx data (packet) is and where the skip data (packet) is. And it calls on separate decoders to decode each. This way we can decouple the question of "how many files do I store my things in" from "how is each thing encoded/decoded". Maybe I want frq/prx/skip all in one file, or maybe I want them in 3 different files. > For reasons I won't go into here, KS doesn't need to put a field number in > it's TermInfo, but it does need doc freq, plus file positions for the > postings file, the skip file, and the primary Lexicon file. (Lexicon is the > KS term dictionary class, akin to Lucene's TermEnum.) > > struct kino_TermInfo { > kino_VirtualTable* _; > kino_ref_t ref; > chy_i32_t doc_freq; > chy_u64_t post_filepos; > chy_u64_t skip_filepos; > chy_u64_t lex_filepos; > }; > > There are two problems. > > First is that I'd like to extend indexing with arbitrary subclasses of > SegDataWriter, and I'd like these classes to be able to put their own file > position bookmarks (or possibly other data) into TermInfo. Making TermInfo > hash-based would probably do it, but there would be nasty performance and > memory penalties since TermInfo objects are numerous. > > So, what's the best way to allow multiple, unrelated classes to extend > TermInfo and the term dictionary file format? Is it to break up TermInfo > information horizontally rather than vertically, so that instead of a single > array of TermInfo objects, we have a flexible stack of arrays of 64-bit > integers representing file positions? I think for starters TermInfo must encode N absolute offsets into separate files and/or relative offsets into the same file or some combination. If we want to go even further, such that separate plugins can insert arbitrary fields into the TermInfo, I'm not sure how best to do that... memory cost is important for TermInfo since there are so many of these created. > The second problem is how to share a term dictionary over a cluster. It > would be nice to be able to plug modules into IndexReader that represent > clusters of machines but that are dedicated to specific tasks: one cluster > could be dedicated to fetching full documents and applying highlighting; > another cluster could be dedicated to scanning through postings and > finding/scoring hits; a third cluster could store the entire term dictionary > in RAM. > > A centralized term dictionary held in RAM would be particularly handy for > sorting purposes. The problem is that the file pointers of a term > dictionary are specific to indexes on individual machines. A shared > dictionary in RAM would have to contain pointers for *all* clients, which > isn't really workable. > > So, just how do you go about assembling task specific clusters? The stored > documents cluster is easy, but the term dictionary and the postings are > hard. Phew! This is way beyond what I'm trying to solve now :) > > For example, we should think about the Field APIs. Since we don't have > global field semantics in Lucene I wonder how to handle conflict cases, e. > g. when a document specifies a different posting list format than a previous > one for the same field. The easiest way would be to not allow it and throw > an exception. But this is kind of against Lucene's way of dealing with > fields currently. But I'm scared of the complicated code to handle conflicts > of all the possible combinations of posting list formats. > > > > Yeah. Lucene's field definition conflict-resolution code is gnarly already. > :( Yes. But if each plugin handles its own merging I think this is manageable. > > KinoSearch doesn't have to worry about this, because it has a static > schema (I think?), but isn't as flexible as Lucene. > > > > Earlier versions of KS did not allow the addition of new fields on the fly, > but this has been changed. You can now add fields to an existing Schema > object like so: > > for my $doc (@docs) { > # Dynamically define any new fields as 'text'. > for my $field ( keys %$doc ) { > $schema->add_field( $field => 'text' ); > } > $invindexer->add_doc($doc); > } > > See the attached sample app for that snippet in context. > > Here are some current differences between KS and Lucene: > > * KS doesn't yet purge *old* dynamic field definitions which have > become obsolete. However, that should be possible to add later, > as a sweep triggered during full optimization. > * You can't change the definition of an existing field. > * Documents are hash-based, so you can't have multiple fields with > the same name within one document object. However, I consider > that capability a misfeature of Lucene. > > In summary, I don't think that global field semantics meaningfully restrict > flexibility for the vast majority of users. > > The primary distinction is/was philosophical. IIRC, Doug didn't want to > force people to think about index design in advance, so the Field/Document > API was optimized for newbies. In contrast, KS wants you to give it a > Schema before indexing commences. > > It's still true that full-power KS forces you to think about index design > up-front. However, there's now a KinoSearch::Simple API targeted at newbies > which hides the Schema API and handles field definition automatically -- so > Doug's ease-of-use design goal has been achieved via different means. Fixed schema certainly simplifies many things, but I don't think this is something we can change about Lucene at this point. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org