lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes
Date Thu, 24 May 2007 18:57:16 GMT


Michael McCandless commented on LUCENE-888:

> > I plan to add "private int bufferSize" to BufferedIndexInput,
> > defaulting to BUFFER_SIZE. I think then it would just work w/ your
> > LUCENE-430 patch because your patch sets the clone's buffer to null
> > and then when the clone allocates its buffer it will be length
> > bufferSize. I think?
> True. But it would be nice if it was possible to change the buffer size
> after a clone. For example in SegmentTermDocs we could then adjust the
> buffer size of the cloned freqStream according to the document frequency.
> And in my multi-level skipping patch (LUCENE-866) I could also benefit
> from this functionality.

OK, I agree: let's add a BufferedIndexInput.setBufferSize() and also
openInput(path, bufferSize) to Directory base class & to FSDirectory.

> Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the
> same instance of SegmentTermDocs is used for different terms, then 
> the same clone is used. So actually it would be nice it was possible to 
> change the buffer size after read has performed.
> > Maybe we do the setBufferSize approach, but, if the buffer already
> > exists, rather than throwing an exception we check if the new size is
> > greater than the old size and if so we grow the buffer? I can code this
> > up. 
> So yes, I think we should implement it this way.

OK I will do this.  Actually, I think we should also allow making the
buffer smaller this way.  Meaning, I will preserve buffer contents
(starting from bufferPosition) as much as is allowed by the smaller
buffer.  This way there is no restriction on using this method
vs. having read bytes already ("principle of least surprise").

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>                 Key: LUCENE-888
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message