lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes
Date Thu, 24 May 2007 16:09:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498696
] 

Michael McCandless commented on LUCENE-888:
-------------------------------------------

> > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> > at 1024, at least for now.  During searching there can be quite a few
> > of this class instantiated, and likely a larger buffer size for the
> > freq/prox streams could actually hurt search performance for those
> > searches that use skipping.
> 
> I submitted a patch for LUCENE-430 which avoids copying the buffer when
> a BufferedIndexInput is cloned. With this patch we could also add a 
> method setBufferSize(int) to BufferedIndexInput. This method has to
> be called then right after the input has been created or cloned and
> before the first read is performed (the first read operation allocates
> the buffer). If called later it wouldn't have any effect. This would
> allow us to adjust the buffer size dynamically, e. g. use large buffers
> for segment merges and stored fields, but smaller ones for freq/prox 
> streams, maybe dependent on the document frequency. 
> What do you think?

I like that idea!

I am actually seeing that increased buffer sizes for
BufferedIndexInput help performance of indexing as well (up to ~5%
just changing this buffer), so I think we do want to increase this but
only for merging.

I wonder if we should just add a ctor to BufferedIndexInput that takes
the bufferSize?  This would avoid the surprising API caveat you
describe above.  The problem is, then all classes (SegmentTermDocs,
SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
would also have to have ctors to change buffer sizes.  Even if we do
setBufferSize instead of new ctor we have some cases (eg at least
SegmentTermEnum) where bytes are read during construction so it's too
late for caller to then change buffer size.  Hmmm.  Not clear how to
do this cleanly...

Maybe we do the setBufferSize approach, but, if the buffer already
exists, rather than throwing an exception we check if the new size is
greater than the old size and if so we grow the buffer?  I can code this
up.

> > The CompoundFileWriter buffer is created only briefly, so I think we
> > can use a fairly large (32 KB?) buffer there.  And there should not be
> > too many BufferedIndexOutputs alive at once so I think a large-ish
> > buffer (16 KB?) should be OK.
> 
> I'm wondering how much performance benefits if you increase the buffer 
> size beyond the file system's page size? Does it make a big difference
> if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think
> the numbers you propose are good.

I'm testing now different sizes of each of these three buffers
... will post the results.


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message