Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 41566 invoked from network); 2 Mar 2010 10:56:22 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Mar 2010 10:56:22 -0000 Received: (qmail 5926 invoked by uid 500); 2 Mar 2010 10:56:18 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 5866 invoked by uid 500); 2 Mar 2010 10:56:17 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 5856 invoked by uid 99); 2 Mar 2010 10:56:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Mar 2010 10:56:17 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.83.48] (HELO mail-gw0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Mar 2010 10:56:08 +0000 Received: by gwaa11 with SMTP id a11so31577gwa.35 for ; Tue, 02 Mar 2010 02:55:47 -0800 (PST) MIME-Version: 1.0 Received: by 10.150.56.16 with SMTP id e16mr320905yba.153.1267527344679; Tue, 02 Mar 2010 02:55:44 -0800 (PST) In-Reply-To: <20100228183809.GA8267@rectangular.com> References: <9ac0c6aa1002260950q3ebb0638y917d7203435de1ec@mail.gmail.com> <20100228183809.GA8267@rectangular.com> Date: Tue, 2 Mar 2010 05:55:44 -0500 Message-ID: <9ac0c6aa1003020255h5eb6be80o2dc53dfba0105c05@mail.gmail.com> Subject: Re: Baby steps towards making Lucene's scoring more flexible... From: Michael McCandless To: java-dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 On Sun, Feb 28, 2010 at 1:38 PM, Marvin Humphrey wrote: > On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote: > >> * Store additional per-doc stats in the index, eg in a custom >> posting list, > > Inline, as in a payload? Of course that can work, but if the data > is common over multiple postings, you pay in space to gain locality. > KinoSearch does this right now -- inlining the norm byte within each > posting -- but I think it was a dubious tradeoff. No, not inlined -- stored once, seperately, and iterated directly from there. I think the stats will be used, once, at searcher init time, and turned into in-RAM arrays (like norms). > But why do we have to make a choice one way or the other? So long > as the decoder has the option of reading from arbitrary streams and > data structures, we can leave it up to the posting format spec > whether to inline or not. Well for starters I think we should hardwire the stats (computing only field length & avg term freq). Later we can make what's stored, and, how it's stored pluggable. Baby steps... >> including length in tokens of the field, avg tf, and boost >> (boost can be efficiently stored so only if it differs from >> default is it stored). > > Efficient compression always depends on identifying patterns within > data. Classical examples: > > * Audio signals delta encode well because they are continuously > variable and adjacent samples are usually close together in > magnitude. > * Line art has a great affinity for run-length-encoding because > many contiguous pixels have exactly the same color value. > > Since boost is optional and a lot of people don't use it, it's true > that if we break it out away from length normalization it can be > stored efficiently using something like RLE. However, boost isn't > the only thing that may be worth compressing: single-token and > match-only fields have constant length-in-tokens. Good point. For starters I think we'll use some simple fixed encoding. Users can turn on/off the stats if they don't want to pay the disk space. > Surely the posting format specification must be allowed to exploit > any compression technique, at own discretion and with insight into > data patterns unavailable to us from the top level. Eventually, yes (when we make this pluggable). >> Do not compute nor store norms in the index. > > Heh. Well, with Lucy, naturally we'd want to do the opposite of > that: write everything at index time and then mmap at search time. > Even in Lucene, it seems odd to want to calculate all of those on > the fly each time you open an index. It seems to me that this is a > specialized need of BM25. The problem is, these scoring models need the avg field length (in tokens) across the entire index, to compute the norms. Ie, you can't do that on writing a single segment. So I think it must be done during searcher init. The most we can do is store the aggregates (eg sum of all lengths in this segment) in the SegmentInfo -- this saves one pass on searcher init. > So how do you support both? Easy: allow the postings reader an > initial startup phase where it can cache arbitrary data, typically > data which is shared across multiple postings. For instance: > > * Either slurp (Lucene) or mmap (Lucy) a pregenerated norms file. > * Accumulate/derive data, e.g true normalization data for BM25. > * Do nothing, because the field is match-only. > > You'd want to do this inside some sort of per-field class, which > would then spawn individual enumerators. The shared reader holds > common data; the enumerators reference the common data, but maintain > state while iterating over postings. > > SegmentReader > PostingListReader > map of field-PostingsReader pairs <-- per-field shared data, persistent > > Requests for various enumerators (doc id only, doc id + positions, > etc) would be dispatched through to the PostingsReader for the > specified field. This is roughly how flex does it today for postings (though you gain access to the postings reader via fields & terms enums). But stats are different? They are like column stride fields -- ie a posting list that is usually dense (every doc is included), and is iterated at the field level (not field + term). Or, hmmm... maybe term will be the name of the statistic, and maybe we use Attribute to keep it "generic". Maybe.... >> Merging would just concatenate these values (removing deleted >> docs). > > The ability to concatenate depends on having either no externally > referenced data, or having all externally referenced data be > immutable. > > Theoretically, concatenation works fine for delta-encoded position > data because positions start over at 0 with each document. They're > wholly self-contained. > > As a counter example, consider that we can concatenate serialized > documents so long as the associations between field names and > numbers stay constant, but we must decode and re-encode when those > associations change. > > The merger cannot know in advance whether concatenation is > appropriate because the codec is opaque to it. I think the default > should be to merge fully expanded objects, with the option to > concatenate as an optimization left up to the codec. Otherwise, we > end up restricting the kind of external references that can be made > by a posting. Sorry -- concatenate was overloaded term -- I didn't mean we would literally physically concatenate bytes (thus requiring a posting format that allows for that). I meant, logically, this is what merging does. Ie we preserve all the per-doc stats for all non-deleted docs when we merge. For actual impl, I agree default is to fully decode & re-encode. This is how flex handles merging today (but codec can override, at each level of fields/terms/postings, if it wants). Lucene's default postings format could nicely support raw byte concatenation, since the docIDs are delta coded. You'd just have to rewrite the one posting at each boundary. But I don't think we've explored that one yet... could be a good merging speedup though, but only for the no (or maybe, not many?) deletes case. >> * Change IR so on open it generates norms dynamically, ie by >> walking the stats, computing avgs (eg avg field length in >> tokens), and computing the final per-field boost, casting to a >> 1-byte quantized float. > > Right, I think this data should be cached by a per-field parent > reader object during an initialization phase. Yes. >> We may want to store aggregates in eg SegmentInfo to save the >> extra pass on IR open... > > Hmm, that sounds familiar... > > https://svn.apache.org/repos/asf/lucene/lucy/trunk/core/Lucy/Index/Segment.bp > > Each Segment object keeps track of information about an index > segment: its fields, document count, and so on. The Segment > object itself writes one file, segmeta.json; besides > storing info needed by Segment itself, the "segmeta" file serves > as a central repository for metadata generated by other index > components -- relieving them of the burden of storing metadata > themselves. Right, this is Lucy's correlate for Lucene's SegmentInfos/segments_N file. > As far as aggregates go, I think you want to be careful to avoid > storing any kind of data that scales with segment size within a > SegmentInfo. Right. By aggregates I mean eg the total sum of all field lengths (a single long) for the segment. >> * Change Similarity, to allow field-specific Similarity (I think >> we have issue open for this already). I think, also, >> lengthNorm (which is no longer invoked during indexing) would >> no longer be used. > > Well, as you might suspect, I consider that one a gimme. KinoSearch > supports per-field Similarity now. Great! This is pretty much required, since each field now has substantial init cost & state associated with it. The norms array will be stored in this per-field sim instance. > The insane loose typing of fields in Lucene is going to make it a > little tricky to implement, though. I think you just have to > exclude fields assigned to specific similarity implementations from > your merge-anything-to-the-lowest-common-denominator policy and > throw exceptions when there are conflicts rather than attempt to > resolve them. Our disposition on conflict (throw exception vs silently coerce) should just match what we do today, which is to always silently coerce. I think I'd coerce the way norms do -- "once enabled always enabled". Throwing exception is bad because we can't detect that until it's too late, at which point no merges of different segments can run. If you mark the fields as 'no stats', index docs, then later mark the field as 'stats', then on merge the stats will be set to default values. >> I think we'd make the class that computes norms from these per-doc >> stats on IR open pluggable. > > Similarity is where we decode norms right now. In my opinion, it > should be the Similarity object from which we specify per-field > posting formats. I agree. > See my reply to Robert in the BM25 thread: > > http://markmail.org/message/77rmrfmpatxd3p2e > > That way, custom scoring implementations can guarantee that they > always have the posting information they need available to make > their similarity judgments. Similarity also becomes a more > generalized notion, with the TF/IDF-specific functionality moving > into a subclass. > > Similarity implementation and posting format are so closely related > that in my opinion, it makes sense to tie them. This confuses me -- what is stored in these stats (each field's token length, each field's avg tf, whatever other a codec wants to add over time...) should be decoupled from the low level format used to store it? A BM25 similarity impl plugged into Lucene shouldn't know or care how the codec had written the stats to disk? It only cares that it can ask for the full enum (field token length for all docs). >> And, someday we could make what stats are gathered/stored during >> indexing pluggable but for starters I think we should simply >> support the field length in tokens and avg tf per field. > > I would argue against making this your top priority, because I think > adding half-baked features that require index-format changes is bad > policy. I'm also against doing this ("making what stats are stored, and how, fully pluggable") for the dirt path. I plan to start with "only" the 2 stats (per doc field token length and per doc avg tf). I'm going to pick an index format that encodes these stats. It won't be pluggable on the first go. Start simple and then iterate... > If you're looking for small steps, my suggestion would be to focus > on per-field Similarity support. Well that alone isn't sufficient -- the index needs to record/provide the raw stats, and doc boosting (norms array) needs to be done using these stats. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org