lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes
Date Sat, 26 May 2007 03:13:16 GMT


Marvin Humphrey commented on LUCENE-888:

I have some auxiliary data points to report after experimenting with buffer
size in KS today on three different systems: OS X 10.4.9, FreeBSD 5.3, and an
old RedHat 9 box.  

The FS i/o classes in KinoSearch use a FILE* and
fopen/fwrite/fread/fseek/ftell, rather than file descriptors and the POSIX
family of functions.  Theoretically, this is wasteful because FILE* stream i/o
is buffered, so there's double buffering happening.  I've meant to change that
for some time.  However, when I've used setvbuf(self->fhandle, NULL, _IONBF)
to eliminate the buffer for the underlying FILE* object, performance tanks --
indexing time doubles.  I still don't understand exactly why, but I know a
little more now.

  * Swapping out the FILE* for a descriptor and switching all the I/O calls to
    POSIX variants has no measurable impact on any of these systems.

  * Changing the KS buffer size from 1024 to 4096 has no measurable impact on
    any of these systems.

  * Using setvbuf to eliminate the buffering at output turns out to have no
    impact on indexing performance.  It's only killing off the read mode FILE*
    buffer that causes the problem.

So, it seems that the only change I can make moves the numbers in the wrong

The results are somewhat puzzling because I would ordinarily have blamed
sub-optimal flush/refill scheduling in my app for the degraded performance
with setvbuf() on read mode.  However, the POSIX i/o calls are unbuffered, so
that's not it.  My best guess is that disabling buffering for read mode
disables an fseek/ftell optimization.  

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>                 Key: LUCENE-888
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch, LUCENE-888.take2.patch
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message