lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <>
Subject [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Sat, 23 Jun 2007 06:59:26 GMT


Doron Cohen commented on LUCENE-843:

Mike, I am considering testing the performance of this patch on a somewhat different use case,
real one I think. After indexing 25M docs of TREC .gov2 (~500GB of docs) I pushed the index
terms to create a spell correction index, by using the contrib spell checker. Docs here are
*very* short - For each index term a document is created, containing some N-GRAMS. On the
specific machine I used there are 2 CPUs but the SpellChecker indexing does not take advantage
of that. Anyhow, 126,684,685 words==documents were indexed. 
For the docs adding step I had:
    mergeFactor = 100,000
    maxBufferedDocs = 10,000
So no merging took place.
This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB. 
Then I optimized the index with
    mergeFactor = 400
(Larger values were hard on the open files limits.)

I thought it would be interesting to see how the new code performs in this scenario, what
do you think?

If you too find this comparison interesting, I have two more questions:
  - what settings do you recommend? 
  - is there any chance for speed-up in optimize()?  I didn't read your 
    new code yet, but at least from some comments here it seems that 
    on disk merging was not changed... is this (still) so? I would skip the 
    optimize part if this is not of interest for the comparison. (In fact I am 
    still waiting for my optimize() to complete, but if it is not of interest I 
    will just interrupt it...)


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>                 Key: LUCENE-843
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments:,,
LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch,
LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch,
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message