lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <>
Subject [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Sun, 24 Jun 2007 19:57:26 GMT


Doron Cohen commented on LUCENE-843:

Just to clarify your comment on reusing field and doc instances - to my understanding reusing
a field instance is ok *only* after the containing doc was added to the index.

For a "fair" comparison I ended up not following most of your recommendations, including the
reuse field/docs one and the non-compound one (apologies:-)), but I might use them later.

For the first 100,000,000 docs (==speller words) the speed-up is quite amazing:
    Orig:    Speller: added 100000000 words in 10912 seconds = 3 hours 1 minutes 52 seconds
    New:   Speller: added 100000000 words in 58490 seconds = 16 hours 14 minutes 50 seconds
This is 5.3 times faster !!!

This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). 
I stopped the run now, I don't expect to learn anything new by letting it continue.

When trying with  MEM=512MB, it at first seemed faster, but then there were now and then local
slow-downs, and eventually it became a bit slower than the previous run. I know these are
not merges, so they are either flushes (RAM directed), or GC activity. I will perhaps run
with GC debug flags and perhaps add a print at flush so to tell the culprit for these local

Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with this patch.
The way I indexed it before it took about 4 days - running in 4 threads, and creating 36 indexes.
This is even more a real life scenario, it involves HTML parsing, standard analysis, and merging
(to some extent). Since there are 4 threads each one will get, say, 250MB. Again, for a "fair"
comparison, I will remain with compound.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>                 Key: LUCENE-843
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments:,,
LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch,
LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch,
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message