From java-dev-return-25260-apmail-lucene-java-dev-archive=lucene.apache.org@lucene.apache.org Fri Apr 11 00:11:01 2008 Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 23741 invoked from network); 11 Apr 2008 00:11:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Apr 2008 00:11:01 -0000 Received: (qmail 57512 invoked by uid 500); 11 Apr 2008 00:10:55 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 57472 invoked by uid 500); 11 Apr 2008 00:10:55 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 57459 invoked by uid 99); 11 Apr 2008 00:10:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Apr 2008 17:10:55 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [68.116.38.223] (HELO rectangular.com) (68.116.38.223) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Apr 2008 00:09:58 +0000 Received: from c-76-115-21-31.hsd1.or.comcast.net ([76.115.21.31] helo=[192.168.11.50]) by rectangular.com with esmtpa (Exim 4.44) id 1Jk6t5-0004NV-1J for java-dev@lucene.apache.org; Thu, 10 Apr 2008 17:13:03 -0700 Message-Id: From: Marvin Humphrey To: java-dev@lucene.apache.org In-Reply-To: <9ac0c6aa0804100237l2672212dlcd2f12dbfe83f457@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: Pooling of posting objects in DocumentsWriter Date: Thu, 10 Apr 2008 17:10:13 -0700 References: <47FB88D2.2050505@gmail.com> <24883157-DFCF-4085-B239-4302360FA325@mikemccandless.com> <7F4D259D-3778-4FC3-A793-799AB17F0260@rectangular.com> <9ac0c6aa0804100237l2672212dlcd2f12dbfe83f457@mail.gmail.com> X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote: >> IMO, the abstract base Posting class should not track text. It >> should >> include only one datum: a document number. This keeps it in line >> with the >> simplest IR definition for a "posting": one document matching one >> term. > > But how do you then write out a segment with the terms packed, in > sorted order? Your "generic" layer needs to know how to sort these > Posting lists by term text, right? Yes. But the generic layer is the semi-serialized export form RawPosting, not the abstract base class Posting. (Would it be clearer if RawPosting was named "FlatPosting"?) struct kino_Posting { kino_VirtualTable* _; kino_ref_t ref; chy_u32_t doc_num; }; struct kino_RawPosting { kino_VirtualTable* _; kino_ref_t ref; chy_u32_t doc_num; chy_u32_t freq; chy_u32_t content_len; chy_u32_t aux_len; char blob[1]; /* flexible array */ }; The last member of RawPosting, "blob", uses the C "flexible array" hack which takes advantage of C's lack of bounds checking. A RawPosting consists of a single contiguous memory allocation, but the size of that allocation varies depending on how much information the object needs to carry. The "blob" is divided into two parts: content, and "aux". Content is term text. "aux" is encoded material which will be written to the postings file. The total allocation for any given RawPosting can be calculated like so: offsetof(RawPosting, blob) + posting->content_len + posting->aux_len Sorting is done by comparing first the "content" component of blob using memcmp(), then breaking ties with content_len and finally, doc_num. The write method invoked by PostingsWriter is the same -- RawPost_Write_Record -- no matter what Posting subclass is assigned to the field. The first part of the implementing function RawPost_write_record may look familiar to you, as it uses a close variant of Lucene's algo for encoding doc_num and freq together. Note the last line, though: void RawPost_write_record(RawPosting *self, OutStream *outstream, u32_t last_doc_num) { const u32_t delta_doc = self->doc_num - last_doc_num; char *const aux_content = self->blob + self->content_len; /* Write doc num, freq. */ if (self->freq == 1) { const u32_t doc_code = (delta_doc << 1) | 1; OutStream_Write_C32(outstream, doc_code); } else { const u32_t doc_code = delta_doc << 1; OutStream_Write_C32(outstream, doc_code); OutStream_Write_C32(outstream, self->freq); } /* Print prepared data. */ OutStream_Write_Bytes(outstream, aux_content, self->aux_len); } The idea is that information other than doc num and freq gets serialized during the export operation which creates the RawPosting -- then the last line of RawPost_write_record prints the pre-serialized data to the OutStream. > OK I now see that what we call Posting really should be called > PostingList: each instance of this class, in DW, tracks all documents > that contained that term. I suppose that would be more accurate... though in KS, the "PostingList" class is the replacement for TermDocs, TermPositions, etc. > Whereas for KS, Posting is a single occurrence of term in a single > doc, right? It's one term matching one document. These are the definitions I'm using, as set out in KinoSearch::Docs::IRTheory at : * document - An atomic unit of retrieval. * term - An attribute which describes a document. * posting - One term indexing one document. * term list - The complete list of terms which describe a document. * posting list - The complete list of documents which a term indexes. * inverted index - A data structure which maps from terms to documents. > Does a Posting contain all > occurrences of the term in the doc (multiple positions) or only one? In KS, a Posting object can represent multiple occurrences of a term within a single document. That was somewhat arbitrary on my part. > How do you do buffering/flushing? RawPostings are fed into an object which handles external sorting. The external sorter tracks RAM usage, and checks once per document whether it should flush a sorted run to temporary storage. After the last document gets added, the external sorter flips to fetching mode. The pre-sorted runs are merged into a sorted stream of RawPostings, and the postings and lexicon files get written out. >> Then, for search-time you have a PostingList class which takes the >> place of >> TermDocs/TermPositions, and uses an underlying Posting object to >> read the >> file. (PostingList and its subclasses don't know anything about file >> formats.) > > Wouldn't PostingList need to know something of the file format? EG > maybe it's a sparse format (docID or gap encoded each time), or, it's > non-sparse (like norms, column-stride fields). Each subclass of Posting implements its own Read_Record() method. PostingList just hands off the InStream to its internal Posting object -- it doesn't know or care how much data from the InStream the Posting consumes with each Read_Record(). >> As I have argued before, the key is to have each Posting subclass >> wholly >> define a file format. That makes them pluggable, breaking the >> tight binding >> between the Lucene codebase and the Lucene file format spec. > > It's not just Posting that defines the file format. I could have been clearer -- I meant a postings file format. > I think TermVectorsWriter should be seen as a consumer of the > "inversion" plugin API. It's just that, unlike the frq/prx writer, > which flushes when RAM is full, the TermVectorsWriter flushes after > each doc. Ie, the generic code does the inversion, feeding "you" > Posting occurrences, and "you" write this to a file however you want. Yes, that's how it works in KS, too. :) I haven't written a formal Inversion class yet, though. Upon reflection, perhaps it would be better to have "Inversion" be specific to a single field, and "InvertedDocument" collect everything together? public void addDocument(Document document) { InvertedDocument inverteDoc = invert(document) for (i = 0; i < writers.length; i++) { writers[i].addInvertedDocument(invertedDoc); } } I think that's better in KS, because I can rename "TokenBatch" -- the class that actually does the low-level inverting and that subclasses of Analyzer need to deal with -- to "Inversion". Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org