lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Mon, 30 Apr 2007 11:45:15 GMT


Marvin Humphrey commented on LUCENE-843:

How are you writing the frq data in compressed format?  The  works fine for
prx data, because the deltas are all within a single doc -- but for  the freq
data, the deltas are tied up in doc num deltas, so you have to decompress it
when performing merges.  

To continue our discussion from java-dev... 

 * I haven't been able to come up with a file format tweak that 
   gets around this doc-num-delta-decompression problem to enhance the speed
   of frq data merging. I toyed with splitting off the freq from the
   doc_delta, at the price of increasing the file size in the common case of
   freq == 1, but went back to the old design.  It's not worth the size
   increase for what's at best a minor indexing speedup.
 * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
   chunks, allows resizing (downwards) of only the last allocation, and can
   only release everything at once.  From one of these pools, I'm allocating
   RawPosting objects, each of which is a doc_num, a freq, the term_text, and
   the pre-packed prx data (which varies based on which Posting subclass
   created the RawPosting object).  I haven't got things 100% stable yet, but
   preliminary results seem to indicate that this technique, which is a riff
   on your persistent arrays, improves indexing speed by about 15%.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>                 Key: LUCENE-843
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch,
LUCENE-843.take4.patch, LUCENE-843.take5.patch
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message