lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Bridges (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3932) Improve load time of .tii files
Date Thu, 29 Mar 2012 17:24:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241405#comment-13241405
] 

Sean Bridges commented on LUCENE-3932:
--------------------------------------

I was doing tests on my local machine with an ssd, and loading is definitely cpu bound.

Our index has 600,000,000 terms.  This is an index of 10,000,000 emails, with associated attachments.
 We generate a lot of garbage terms when parsing, things like time stamps, malformed attachments
which parse badly, etc.

After the change the big time waste is converting the terms from utf8 to utf16 when reading
from the .tii file, and then back to utf8 when writing to the in memory store.
                
> Improve load time of .tii files
> -------------------------------
>
>                 Key: LUCENE-3932
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3932
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.5
>         Environment: Linux
>            Reporter: Sean Bridges
>
> We have a large 50 gig index which is optimized as one segment, with a 66 MEG .tii file.
 This index has no norms, and no field cache.
> It takes about 5 seconds to load this index, profiling reveals that 60% of the time is
spent in GrowableWriter.set(index, value), and most of time in set(...) is spent resizing
PackedInts.Mutatable current.
> In the constructor for TermInfosReaderIndex, you initialize the writer with the line,
> {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, false);{quote}
> For our index using four as the bit estimate results in 27 resizes.
> The last value in indexToTerms is going to be ~ tiiFileLength, and if instead you use,
> {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / Math.log10(2));
> GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, false);{quote}
> Load time improves to ~ 2 seconds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message