lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible indexing design
Date Fri, 18 Apr 2008 21:47:25 GMT

On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote:

> If I have a pluggable indexer,
> then on the querying side I need something (I'm not sure what/how)
> that knows how to create the right demuxer (container) and codec
> (decoder) to interact with whatever my indexing plugins wrote.
>
> So I don't think it's "out of bounds" to have *Query classes that know
> to use the non-prx variant of a field when positions aren't needed.
> Though I agree it makes things more complex.

Perhaps switching up files based on Query type isn't "out of bounds",  
but you get hit with a double whammy. First, there's the added  
complexity -- which gets extra nasty when you consider that we  
wouldn't want this optimization to be on by default.  Second, you wind  
up with duplicated information in the index -- resulting in greater  
space requirements and increased indexing time.

That's an awful lot of inconvenience for the sake of an optimization.  
I wouldn't bother.

My initial thought was that the unified postings file format that KS  
is using now offers so many benefits that KS should go with it anyway,  
despite the increased costs for simple TermQueries.   But maybe we can  
nail the design for the divide-by-data-type format right now.

>>> This is like "column stride" vs "row stride" serialization
>>> of a matrix.
>>>
>>> Relatively soon, though, we will all be on SSDs, so maybe this
>>> locality argument becomes far less important ;)
>>>
>>
>> Yes, I've thought about that.  It defeats the phrase-query locality
>> argument for unified postings files and recommends breaking things up
>> logically by type of data into frq/prx/payload/whatever.
>>
>> Would it be possible to design a Posting plugin class that reads from
>> multiple files?  I'm pretty sure the answer is yes.  It messes up the
>> single-stream readRecord() method I've been detailing and might force
>> Posting to maintain state.  But if Postings are scarce TermBuffer- 
>> style
>> objects where values mutate, rather than populous Term-style  
>> objects where
>> you need a new instance for each set of values, then it doesn't  
>> matter if
>> they're large.
>>
>> If that could be done, I think it would be possible to retrofit the
>> Posting/PostingList concept into Lucene without a file format  
>> change.  FWIW.
>
> I think this is possible.

We should aggressively explore this concept.  By my tally, breaking up  
posting content into multiple files by data type has the most benefits  
and the fewest drawbacks.

Brainstorming...

In devel branch KS, ScorePost_Read_Record reads three blocks of data  
from an InStream:

   doc_num/freq
   boost/norm byte
   positions

Let's say that instead of ScorePost_Read_Record operating on a passed- 
in InStream, we give ScorePosting a ScorePost_Next() method, and have  
it operate on three internally maintained InStreams:

     void
     ScorePost_next(ScorePosting *self)
     {
         u32_t  num_prox;
         u32_t  position = 0;
         u32_t *positions;
         const u32_t doc_code = InStream_Read_C32(self->frq_in);
         const u32_t doc_delta = doc_code >> 1;

         /* Apply delta doc and retrieve freq. */
         self->doc_num  += doc_delta;
         if (doc_code & 1)
             self->freq = 1;
         else
             self->freq = InStream_Read_C32(self->frq_in);

         /* Decode boost/norm byte. */
         self->impact = self->sim->norm_decoder[
                 InStream_Read_U8(self->norms_in)
             ];

         /* Read positions. */
         num_prox = self->freq;
         if (num_prox > self->prox_cap) {
             self->prox = REALLOCATE(self->prox, num_prox, u32_t);
         }
         positions = self->prox;
         while (num_prox--) {
             position += InStream_Read_C32(self->prx_in);
             *positions++ = position;
         }
     }

So, there we have a single operation, in a single class, defining a  
codec -- achieving my main goal, even while reading from multiple files.

But check this out: we could feed the ScorePosting constructor the  
same InStream object three times, and it wouldn't know the  
difference.  So the same read operation can be used against both a  
multi-file format and a single file format.

Seeking might get a little weird, I suppose.

I think a variant of that read op could be created against various  
versions of the Lucene file format going back, making it possible to  
isolate and archive obsolete codecs and clean up the container classes.

> When reading an index, the
> Posting/PostingList should be more like TermBuffer than Term.

Yes.

Now's a good time to remark on a difference between KS and Lucene when  
reading the equivalent of TermPositions: In KS, all positions are read  
in one grab during ScorePost_Read_Record -- there's no nextPosition()  
method.  That was a tradeoff I made primarily for simplicity's sake,  
since it meant that PhraseScorer could be implemented with integer  
arrays and pointer math.  Reduced CPU overhead was another theoretical  
benefit, but I've never benchmarked it.

If you didn't want to do that but you still wanted to implement a  
PhraseScorer based around PostingList objects rather than  
TermPositions objects, you have a bit of a quandary: PostingList only  
advances in doc-sized increments, because it doesn't have a  
nextPosition() method.  So, nextPosition() would have to be  
implemented by ScorePosting:

   // Iterate over docs and positions.
   while (postingList.next()) {
     ScorePosting posting = postingList.getPosting();
     while (posting.nextPostition()) {
       int position = posting.GetPosition();
       ...
     }
   }

If we did that, it would require a change in the behavior for  
ScorePost_Next(), above.

> Thinking about a codec reading multiple files... I would also love
> this: a codec that can write & read layered updates to the inverted
> files.

IMO, it should not be the codec object that performs actions like this  
-- it should be a container class that the user knows nothing about.

> I agree, Lucene needs a stronger separation of "container" from
> "codec".  If someone just wants to plugin a new codec they should be
> able to cleanly plug into an existing container and "only" provide the
> codec.
>
> EG, say I want to store say an arbitrary array of ints per document.
> Not all documents have an array, and when they do, the array length
> varies.
>
> To do this, I'd like to re-use a container that knows how to store a
> byte blob per document, sparsely, and say indexed with a multi-level
> skip list.  That container is exactly the container we now use for
> storing frq/prx data, under each term.  So, ideally, I can easily
> re-use that container and just provide a codec (encoder & decoder)
> that maps from an int[] to a byte blob, and back.
>
> We need to factor things so that this container can be easily shared
> and is entirely decoupled from the codecs it's using.

Perfect.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message