lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes
Date Fri, 25 May 2007 10:10:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499018
] 

Michael McCandless commented on LUCENE-888:
-------------------------------------------

OK I ran two sets of tests.  First is only on Mac OS X to see how
performance changes with buffer sizes.  Second was also on Debian
Linux & Windows XP Pro.

The performance gains are 10-18% faster overall.


FIRST TEST

I increased buffer sizes, separately, for each of BufferedIndexInput,
BufferedIndexOutput and CompoundFileWriter.  Each test is run once on
Mac OS X:

  BufferedIndexInput

      1 K   622 sec (current trunk)
      4 K   607 sec
      8 K   606 sec
     16 K   598 sec
     32 K   606 sec
     64 K   589 sec
    128 K   601 sec

  CompoundFileWriter

      1 K   622 sec (current trunk)
      4 K   599 sec
      8 K   591 sec
     16 K   578 sec
     32 K   583 sec
     64 K   580 sec

  BufferedIndexOutput

      1 K   622 sec (current trunk)
      4 K   588 sec
      8 K   576 sec
     16 K   551 sec
     32 K   566 sec
     64 K   555 sec
    128 K   543 sec
    256 K   534 sec
    512 K   564 sec

Comments:

  * The results are fairly noisy, but, performance does generally get
    better w/ larger buffers.

  * BufferedIndexOutput seems specifically to like very large output
    buffers; the other two seem to have less but still significant
    effect.

Given this I picked 16 K buffer for BufferedIndexOutput, 16 K buffer
for CompoundFileWriter and 4 K buffer for BufferedIndexInput. I think
we would get faster performance for a larger buffer for
BufferedIndexInput, but, even when merging there are quite a few of
these created (mergeFactor * N where N = number of separate index
files).

Then, I re-tested the baseline (trunk) & these buffer sizes across
platforms (below):



SECOND TEST

Baseline (trunk) = 1 K buffers for all 3.  New = 16 K for
BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for
BufferedIndexInput.

I ran each test 4 times & took the best time:

Quad core Mac OS X on 4-drive RAID 0
  baseline  622 sec
  new       527 sec
  -> 15% faster

Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5
  baseline  708 sec
  new       635 sec
  -> 10% faster
  
Windows XP Pro laptop, single drive
  baseline  1604 sec
  new       1308 sec
  -> 18% faster

Net/net it's between 10-18% performance gain overall.  It is
interesting that the system with the "weakest" IO system (one drive on
Windows XP vs RAID 0/5 on the others) has the best gains.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message