lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes
Date Thu, 24 May 2007 16:57:16 GMT


Michael Busch commented on LUCENE-888:

> I wonder if we should just add a ctor to BufferedIndexInput that takes
> the bufferSize? This would avoid the surprising API caveat you
> describe above. The problem is, then all classes (SegmentTermDocs,
> SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
> would also have to have ctors to change buffer sizes. Even if we do
> setBufferSize instead of new ctor we have some cases (eg at least
> SegmentTermEnum) where bytes are read during construction so it's too
> late for caller to then change buffer size. Hmmm. Not clear how to
> do this cleanly...

Yeah I was thinking about the ctor approach as well. Actually 
BufferedIndexInput does not have a public ctor so far, it's created by 
using Directory.openInput(String fileName). And to add a new ctor would 
mean an API change, so subclasses wouldn't compile anymore without 
What me might want to do instead is to add a new new method
openInput(String fileName, int bufferSize) to Directory which calls
the existing openInput(String fileName) by default, so subclasses of
Directory would ignore the bufferSize parameter by default. Then we
can change FSDirectory to overwrite openInput(String, int):

  public IndexInput openInput(String name, int bufferSize) 
		throws IOException {
    FSIndexInput input = new FSIndexInput(new File(directory, name));
	return input;

This should solve the problems you mentioned like in SegmentTermEnum 
and we don't have to support setBufferSize() after a read has been
performed. It has also the advantage that we safe an instanceof and
cast from IndexInput to BufferedIndexInput before setBufferSize()
can be called.

After a clone however, we would still have to cast to 
BufferedIndexInput before setBufferSize() can be called.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>                 Key: LUCENE-888
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message