lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
Date Thu, 20 Oct 2011 18:10:11 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131857#comment-13131857
] 

Michael McCandless commented on LUCENE-2205:
--------------------------------------------

bq. But I may not understand how the GrowableWriter will help. I understand that it allows
me to append more values as I go, but when I start writing them I have no idea what bit size
to choose for the packing. Can you explain?

It actually grows in "both" dimensions -- it tracks the max value so far and internally will
"upgrade" to a bigger bits-per-value as needed.  So eg you could start with small bitsPerValue
(maybe 4 or something) and then let it grow itself.

bq. Also if you would like to finish this patch that would fine with me. Let me know if you
want me to continue or if you are going to work on it. Thanks!

OK thanks Aaron... I'll take a crack at the next iteration.
                
> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index
pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2205
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>         Environment: Java5
>            Reporter: Aaron McCurry
>            Assignee: Michael McCandless
>             Fix For: 3.5
>
>         Attachments: RandomAccessTest.java, TermInfosReader.java, TermInfosReaderIndex.java,
TermInfosReaderIndexDefault.java, TermInfosReaderIndexSmall.java, lowmemory_w_utf8_encoding.patch,
lowmemory_w_utf8_encoding.v4.patch, patch-final.txt, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as an index
offset.  
> The performance benefits are stagering on my test index (of size 6.2 GB, with ~1,000,000
documents and ~175,000,000 terms), the memory needed to load the terminfos into memory were
reduced to 17% of there original size.  From 291.5 MB to 49.7 MB.  The random access speed
has been made better by 1-2%, load time of the segments are ~40% faster as well, and full
GC's on my JVM were made 7 times faster.
> I have already performed the work and am offering this code as a patch.  Currently all
test in the trunk pass with this new code enabled.  I did write a system property switch to
allow for the original implementation to be used as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message