lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: improve how IndexWriter uses RAM to buffer added documents
Date Thu, 05 Apr 2007 20:49:56 GMT

On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:

>>> (I think for KS you "add" a previous segment not that
>>> differently from how you "add" a document)?
>>
>> Yeah.  KS has to decompress and serialize posting content, which sux.
>>
>> The one saving grace is that with the Fibonacci merge schedule and
>> the seg-at-a-time indexing strategy, segments don't get merged nearly
>> as often as they do in Lucene.
>
> Yeah we need to work on this one.

What we need to do is cut down on decompression and conflict  
resolution costs when reading from one segment to another.  KS has  
solved this problem for stored fields.  Field defs are global and  
field values are keyed by name rather than field number in the field  
data file.  Benefits:

   * Whole documents can be read from one segment to
     another as blobs.
   * No flags byte.
   * No remapping of field numbers.
   * No conflict resolution at all.
   * Compressed, uncompressed... doesn't matter.
   * Less code.
   * The possibility of allowing the user to provide their
     own subclass for reading and writing fields. (For
     Lucy, in the language of your choice.)

What I haven't got yet is a way to move terms and postings  
economically from one segment to another.  But I'm working on it.  :)

> One thing that irks me about the
> current Lucene merge policy (besides that it gets confused when you
> flush-by-RAM-usage) is that it's a "pay it forward" design so you're
> alwa>ys over-paying when you build a given index size.  With KS's
> Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

However, even under Fibo, when you get socked with a big merge, you  
really get socked.  It bothers me that the time for adding to your  
index can vary so unpredictably.

> Segment merging really is costly.  In building a large (86 GB, 10 MM
> docs) index, 65.6% of the time was spent merging!  Details are in
> LUCENE-856...

> This is a great model.  Are there Python bindings to Lucy yet/coming?

I'm sure that they will appear once the C core is ready.  The  
approach I am taking is to make some high-level design decisions  
collaboratively on lucy-dev, then implement them in KS.  There's a  
large amount of code that has been written according to our specs  
that is working in KS and ready to commit to Lucy after trivial  
changes.  There's more that's ready for review.  However, release of  
KS 0.20 is taking priority, so code flow into the Lucy repository has  
slowed.

I'll also be looking for a job in about a month.  That may slow us  
down some more, though it won't stop things --  I've basically  
decided that I'll do what it takes to Lucy off the ground.  I'll go  
with something stopgap if nothing materializes which is compatible  
with that commitment.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message