lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes
Date Fri, 25 May 2007 21:01:18 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499226
] 

Michael McCandless commented on LUCENE-888:
-------------------------------------------


> we did some time ago a few tests and simply concluded, it boils down to what Doug said,
"It does incur extra system calls, but uses less memory, which is a tradeoff."
>
> in *our* setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a
compromise between time spent in extra os calls vs probability of reading more than you will
really use (waste time on it). Why 4k number happens often to be good compromise is probably
the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent
on system calls.
> 
> The only conclusion we came up to is that you have to measure it and find good compromise.
Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot
of terms (80-200), Win 2003 Server, single disk.
>
> And I do not remember was it before or after we started using MMAP, so no idea really
if this affects MMAP setup, guess not. 

Interesting!  Do you remember if your 5-8% gain was for searching or
indexing?  This issue is focusing on buffer size impact on indexing
performance and LUCENE-893 is focusing on search performance.


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message